uwsgi listen queue fills on reload - django

I'm running a Django app on uwsgi with an average of 110 concurrent users and 5 requests per second during peak hours. I'm finding that when I deploy with uwsgi reload during these peak hours I am starting to run into an issue where workers keep getting slowly killed and restarted, and then the uwsgi logs begin to throw an error:
Gracefully killing worker 1 (pid: 25145)...
Gracefully killing worker 2 (pid: 25147)...
... a few minutes go by ...
worker 2 killed successfully (pid: 25147)
Respawned uWSGI worker 2 (new pid: 727)
... a few minutes go by ...
worker 2 killed successfully (pid: 727)
Respawned uWSGI worker 2 (new pid: 896)
... this continues gradually for 25 minutes until:
*** listen queue of socket "127.0.0.1:8001" (fd: 3) full !!! (101/100) ***
At this point my app rapidly slows to a crawl and I can only recover with a hard uwsgi stop followed by a uwsgi start. There are some relevant details which make this situation kind of peculiar:
This only occurs when I uwsgi reload, otherwise the listen queue never fills up on its own
The error messages and slowdown only start to occur about 25 minutes after the reload
Even during the moment of crisis, memory and CPU resources on the machine seem fine
If I deploy during lighter traffic times, this issue does not seem to pop up
I realize that I can increase the listen queue size, but that seems like a band-aid more than an actual solution. And the fact that it only fills up during reload (and takes 25 minutes to do so) leads me to believe that it will fill up eventually regardless of the size. I would like to figure out the mechanism that is causing the queue to fill up and address that at the source.
Relevant uwsgi config:
[uwsgi]
socket = 127.0.0.1:8001
processes = 4
threads = 2
max-requests = 300
reload-on-rss = 800
vacuum = True
touch-reload = foo/uwsgi/reload.txt
memory-report = true
Relevant software version numbers:
uwsgi 2.0.14
Ubuntu 14.04.1
Django 1.11.13
Python 2.7.6
It appears that our touch reload is not graceful when we have slight traffic, is this to be expected or do we have a more fundamental issue?

On uwsgi there is a harakiri mode that will periodically kill long running processes to prevent unreliable code from hanging (and effectively taking down the app). I would suggest looking there for why your processes are being killed.
As to why a hard stop works and a graceful stop does not -- it seems to further indicate your application code is hanging. A graceful stop will send SIGHUP, which allows resources to be cleaned up in the application. SIGINT and SIGTERM follow the harsher guidelines of "stop what you are doing right now and exit".
Anyway, it boils down to this not being a uwsgi issue, but an issue in your application code. Find what is hanging and why. Since you are not noticing CPU spikes; some probable places to look are...
blocking connections
locks
a long sleep
Good luck!

The key thing you need to look is "listen queue of socket "127.0.0.1:8001" (fd: 3) full !!! (101/100)"
Default listen queue size is 100. Increase the queue size by adding the option "listen" in uwsgi.ini.
"listen = 4096"

Related

uWSGI listen queue of socket full

My setup includes Load Balancer (haproxy) with two nginx servers running Django. Server 2 works fine but sometimes server 1 will start crashing and log will be full of
*** uWSGI listen queue of socket ":8000" (fd: 3) full !!! (101/100) ***
message.
How do I go about resolving this issue?
Your listen queue is full. When you run uwsgi, pass it --listen 1024 to increase the queue to 1024.
Note that a larger queue makes you more susceptible to a DDoS attack.
You may also need to increase net.core.somaxconn
sysctl -w net.core.somaxconn=65536

uWSGI downtime when restart

I have a problem with uwsgi everytime I restart the server when I have a code updates.
When I restart the uwsgi using "sudo restart accounting", there's a small gap between stop and start instance that results to downtime and stops all the current request.
When I try "sudo reload accounting", it works but my memory goes up (double). When I run the command "ps aux | grep accounting", it shows that I have 10 running processes (accounting.ini) instead of 5 and it freezes up my server when the memory hits the limit.
accounting.ini
I am running
Ubuntu 14.04
Django 1.9
nginx 1.4.6
uwsgi 2.0.12
This is how uwsgi does graceful reload. Keeps old processes until requests are served and creates new ones that will take over incoming requests.
Read Things that could go wrong
Do not forget, your workers/threads that are still running requests
could block the reload (for various reasons) for more seconds than
your proxy server could tolerate.
And this
Another important step of graceful reload is to avoid destroying
workers/threads that are still managing requests. Obviously requests
could be stuck, so you should have a timeout for running workers (in
uWSGI it is called the “worker’s mercy” and it has a default value of
60 seconds).
So i would recommend trying worker-reload-mercy
Default value is to wait 60 seconds, just lower it to something that your server can handle.
Tell me if it worked.
Uwsgi chain reload
This is another try to fix your issue. As you mentioned your uwsgi workers are restarting in a manner described below:
send SIGHUP signal to the master
Wait for running workers.
Close all of the file descriptors except the ones mapped to sockets.
Call exec() on itself.
One of the cons of this kind of reload might be stuck workers.
Additionaly you report that your server crashes when uwsgi maintains 10 proceses (5 old and 5 new ones).
I propose trying chain reload. DIrect quote from documentation explains this kind of reload best:
When triggered, it will restart one worker at time, and the following worker is not reloaded until the previous one is ready to accept new requests.
It means that you will not have 10 processes on your server but only 5.
Config that should work:
# your .ini file
lazy-apps = true
touch-chain-reload = /path/to/reloadFile
Some resources on chain reload and other kinds are in links below:
Chain reloading uwsgi docs
uWSGI graceful Python code deploy

Celery and RabbitMQ timeouts and connection resets

I'm using RabbitMQ 3.6.0 and Celery 3.1.20 on a Windows 10 machine in a Django application. Everything is running on the same computer. I've configured Celery to Acknowledge Late (CELERY_ACKS_LATE=True) and now I'm getting connection problems.
I start the Celery worker, and after 50-60 seconds of handling tasks each worker thread fails with the following message:
Couldn't ack ###, reason:ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None)
(### is the number of the task)
When I look at the RabbitMQ logs I see this:
=INFO REPORT==== 10-Feb-2016::22:16:16 ===
accepting AMQP connection <0.247.0> (127.0.0.1:55372 -> 127.0.0.1:5672)
=INFO REPORT==== 10-Feb-2016::22:16:16 ===
accepting AMQP connection <0.254.0> (127.0.0.1:55373 -> 127.0.0.1:5672)
=ERROR REPORT==== 10-Feb-2016::22:17:14 ===
closing AMQP connection <0.247.0> (127.0.0.1:55372 -> 127.0.0.1:5672):
{writer,send_failed,{error,timeout}}
The error occurs exactly when the Celery workers are getting their connection reset.
I thought this was an AMQP Heartbeat issue, so I've added BROKER_HEARTBEAT = 15 to my Celery settings, but it did not make any difference.
I was having a similar issue with Celery on Windows with long running
tasks with concurrency=1. The following configuration finally worked for
me:
CELERY_ACKS_LATE = True
CELERYD_PREFETCH_MULTIPLIER = 1
I also started the celery worker daemon with the -Ofair option:
celery -A test worker -l info -Ofair
In my limited understanding, CELERYD_PREFETCH_MULTIPLIER sets the number
of messages that sit in the queue of a specific Celery worker. By
default it is set to 4. If you set it to 1, each worker will only
consume one message and complete the task before it consumes another
message. I was having issues with long-running task because the
connection to RabbitMQ was consistently lost in the middle of the long task, but
then the task was re-attempted if any other message/tasks were waiting
in the celery queue.
The following option was also specific to my situation:
CELERYD_CONCURRENCY = 1
Setting concurrency to 1 made sense for me because I had long running
tasks that needed a large amount of RAM so they each needed to run solo.
#bbaker solution with CELERY_ACKS_LATE (which is task_acks_late in celery 4x) itself did not work for me. My workers are in Kubernetes pods and must be run with --pool solo and each task takes 30-60s.
I solved it by including broker_heartbeat=0
broker_pool_limit = None
task_acks_late = True
broker_heartbeat = 0
worker_prefetch_multiplier = 1

Celery: WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL)

I use Celery with RabbitMQ in my Django app (on Elastic Beanstalk) to manage background tasks and I daemonized it using Supervisor.
The problem now, is that one of the period task that I defined is failing (after a week in which it worked properly), the error I've got is:
[01/Apr/2014 23:04:03] [ERROR] [celery.worker.job:272] Task clean-dead-sessions[1bfb5a0a-7914-4623-8b5b-35fc68443d2e] raised unexpected: WorkerLostError('Worker exited prematurely: signal 9 (SIGKILL).',)
Traceback (most recent call last):
File "/opt/python/run/venv/lib/python2.7/site-packages/billiard/pool.py", line 1168, in mark_as_worker_lost
human_status(exitcode)),
WorkerLostError: Worker exited prematurely: signal 9 (SIGKILL).
All the processes managed by supervisor are up and running properly (supervisorctl status says RUNNNING).
I tried to read several logs on my ec2 instance but no one seems to help me in finding out what is the cause of the SIGKILL. What should I do? How can I investigate?
These are my celery settings:
CELERY_TIMEZONE = 'UTC'
CELERY_TASK_SERIALIZER = 'json'
CELERY_ACCEPT_CONTENT = ['json']
BROKER_URL = os.environ['RABBITMQ_URL']
CELERY_IGNORE_RESULT = True
CELERY_DISABLE_RATE_LIMITS = False
CELERYD_HIJACK_ROOT_LOGGER = False
And this is my supervisord.conf:
[program:celery_worker]
environment=$env_variables
directory=/opt/python/current/app
command=/opt/python/run/venv/bin/celery worker -A com.cygora -l info --pidfile=/opt/python/run/celery_worker.pid
startsecs=10
stopwaitsecs=60
stopasgroup=true
killasgroup=true
autostart=true
autorestart=true
stdout_logfile=/opt/python/log/celery_worker.stdout.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stderr_logfile=/opt/python/log/celery_worker.stderr.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
numprocs=1
[program:celery_beat]
environment=$env_variables
directory=/opt/python/current/app
command=/opt/python/run/venv/bin/celery beat -A com.cygora -l info --pidfile=/opt/python/run/celery_beat.pid --schedule=/opt/python/run/celery_beat_schedule
startsecs=10
stopwaitsecs=300
stopasgroup=true
killasgroup=true
autostart=false
autorestart=true
stdout_logfile=/opt/python/log/celery_beat.stdout.log
stdout_logfile_maxbytes=5MB
stdout_logfile_backups=10
stderr_logfile=/opt/python/log/celery_beat.stderr.log
stderr_logfile_maxbytes=5MB
stderr_logfile_backups=10
numprocs=1
Edit 1
After restarting celery beat the problem remains.
Edit 2
Changed killasgroup=true to killasgroup=false and the problem remains.
The SIGKILL your worker received was initiated by another process. Your supervisord config looks fine, and the killasgroup would only affect a supervisor initiated kill (e.g. the ctl or a plugin) - and without that setting it would have sent the signal to the dispatcher anyway, not the child.
Most likely you have a memory leak and the OS's oomkiller is assassinating your process for bad behavior.
grep oom /var/log/messages. If you see messages, that's your problem.
If you don't find anything, try running the periodic process manually in a shell:
MyPeriodicTask().run()
And see what happens. I'd monitor system and process metrics from top in another terminal, if you don't have good instrumentation like cactus, ganglia, etc for this host.
One sees this kind of error when an asynchronous task (through celery) or the script you are using is storing a lot of data in memory because it leaks.
In my case, I was getting data from another system and saving it on a variable, so I could export all data (into Django model / Excel file) after finishing the process.
Here is the catch. My script was gathering 10 Million data; it was leaking memory while I was gathering data. This resulted in the raised Exception.
To overcome the issue, I divided 10 million pieces of data into 20 parts (half a million on each part). I stored the data in my own preferred local file / Django model every time the length of data reached 500,000 items. I repeated this for every batch of 500k items.
No need to do the exact number of partitions. It is the idea of solving a complex problem by splitting it into multiple subproblems and solving the subproblems one by one. :D

How to gracefully restart django running fcgi behind nginx?

I'm running a django instance behind nginx connected using fcgi (by using the manage.py runfcgi command). Since the code is loaded into memory I can't reload new code without killing and restarting the django fcgi processes, thus interrupting the live website. The restarting itself is very fast. But by killing the fcgi processes first some users' actions will get interrupted which is not good.
I'm wondering how can I reload new code without ever causing any interruption. Advices will be highly appreciated!
I would start a new fcgi process on a new port, change the nginx configuration to use the new port, have nginx reload configuration (which in itself is graceful), then eventually stop the old process (you can use netstat to find out when the last connection to the old port is closed).
Alternatively, you can change the fcgi implementation to fork a new process, close all sockets in the child except for the fcgi server socket, close the fcgi server socket in parent, exec a new django process in the child (making it use the fcgi server socket), and terminate the parent process once all fcgi connections are closed. IOW, implement graceful restart for runfcgi.
So I went ahead and implemented Martin's suggestion. Here is the bash script I came up with.
pid_file=/path/to/pidfile
port_file=/path/to/port_file
old_pid=`cat $pid_file`
if [[ -f $port_file ]]; then
last_port=`cat $port_file`
port_to_use=$(($last_port + 1))
else
port_to_use=8000
fi
# Reset so me don't go up forever
if [[ $port_to_use -gt 8999 ]]; then
port_to_use=8000
fi
sed -i "s/$old_port/$port_to_use/g" /path/to/nginx.conf
python manage.py runfcgi host=127.0.0.1 port=$port_to_use maxchildren=5 maxspare=5 minspare=2 method=prefork pidfile=$pid_file
echo $port_to_use > $port_file
kill -HUP `cat /var/run/nginx.pid`
echo "Sleeping for 5 seconds"
sleep 5s
echo "Killing old processes on $last_port, pid $old_pid"
kill $old_pid
I came across this page while looking for a solution for this problem. Everything else failed, so I looked in to the source code :)
The solution seems to be much simpler. Django fcgi server uses flup, which handles the HUP signal the proper way: it shuts down, gracefully. So all you have to do is to:
send the HUP signal to the fcgi server (the pidfile= argument of runserver will come in handy)
wait a bit (flup allows children processes 10 seconds, so wait a couple more; 15 looks like a good number)
sent the KILL signal to the fcgi server, just in case something blocked it
start the server again
That's it.
You can use spawning instead of FastCGI
http://www.eflorenzano.com/blog/post/spawning-django/
We finally found the proper solution to this!
http://rambleon.usebox.net/post/3279121000/how-to-gracefully-restart-django-running-fastcgi
First send flup a HUP signal to signal a restart. Flup will then do this to all of its children:
closes the socket which will stop inactive children
sends a INT signal
waits 10 seconds
sends a KILL signal
When all the children are gone it will start new ones.
This works almost all of the time, except that if a child is handling a request when flup executes step 2 then your server will die with KeyboardInterrupt, giving the user a 500 error.
The solution is to install a SIGINT handler - see the page above for details. Even just ignoring SIGINT gives your process 10 seconds to exit which is enough for most requests.