uWSGI listen queue of socket full - django

My setup includes Load Balancer (haproxy) with two nginx servers running Django. Server 2 works fine but sometimes server 1 will start crashing and log will be full of
*** uWSGI listen queue of socket ":8000" (fd: 3) full !!! (101/100) ***
message.
How do I go about resolving this issue?

Your listen queue is full. When you run uwsgi, pass it --listen 1024 to increase the queue to 1024.
Note that a larger queue makes you more susceptible to a DDoS attack.
You may also need to increase net.core.somaxconn
sysctl -w net.core.somaxconn=65536

Related

uWSGI Segmentation Fault Prevents Web Server Running

I am currently running a web server through two containers:
NGINX Container: Serves HTTPS requests and redirects HTTP to HTTPS. HTTPS requests are passed through uwsgi to the django app.
Django Container: Runs the necessary Django code.
When running docker-compose up --build, everything compiles correctly until uWSGI raises a Segmentation Fault.
....
django3_1 | Python main interpreter initialized at 0x7fd7bce0d190
django3_1 | python threads support enabled
django3_1 | your server socket listen backlog is limited to 100 connections
django3_1 | your mercy for graceful operations on workers is 60 seconds
django3_1 | mapped 145840 bytes (142 KB) for 1 cores
django3_1 | *** Operational MODE: single process ***
django3_1 | !!! uWSGI process 7 got Segmentation Fault !!!
test_django3_1 exited with code 1
Would appreciate if there's any advice, as I'm not able to see into the container for debugging purposes when it is starting up, therefore I don't know where this segmentation fault is occurring.
The SSL certificates have been correctly set up.
The django3 container was running on a python:3.9-alpine image. This installs Python 3.9.2 on the container. There appear to be some issue between uWSGI and Python dependencies on this version. Rolling the container back to python:3.8-alpine resolved the dependency version mismatch.

Django Channels: Get stuck after period of time

I run code from https://github.com/andrewgodwin/channels-examples/tree/master/multichat for around 50 users.
It goes to get stuck without any notice. Server is not down, access log has nothing special. When I stop daphne server (with Ctrl+C), it takes about 5-10 minutes to completely go down. Sometime I have to run kill command.
It is very weird when I put daphne inside supervisord, I restart it every 30 minutes using crontab, websocket can be connected normally. It's hacky but working.
My config: HAProxy => Daphne
daphne -b 192.168.0.6 -p 8000 yyapp.asgi:application --access-log=/home/admin/daphne.log
backend daphne
balance source
option http-server-close
option forceclose
timeout check 1000ms
reqrep ^([^\ ]*)\ /ws/(.*) \1\ /\2
server daphne 192.168.0.6:8000 check maxconn 10000 inter 5s
Debian: 9.4 (original kernel) on OVH server.
Python: 3.6.4
Daphne: 2.2.1
Channels: 2.1.2
Django: 1.11.15
Redis: 4.0.11
I know this question may be too general, but I really have no ideas with this. I tried upgrade python, re-install all the packages but it didn't work.
Well, web servers and load balancers are, in general, very bad with persistent connections. You need to give Haproxy explicit instructions so it knows when and how to timeout unused tunnels.
There are four timeouts that Haproxy will need to keep track of:
timeout client
timeout connect
timeout server
timeout tunnel
The first three are related to the initial HTTP negotiation phase of the socket connection. As soon as the connection is established, only timeout tunnel matters. You will need to tinker with the values for your own application, but some suggested values to start with are:
timeout client: 25s
timeout connect: 5s
timeout server: 25s
timeout tunnel: 3600s
In your code, that would be:
backend daphne
balance source
option http-server-close
option forceclose
timeout check 1000ms
timeout client 25s
timeout connect 5s
timeout server 25s
timeout tunnel 3600s
reqrep ^([^\ ]*)\ /ws/(.*) \1\ /\2
server daphne 192.168.0.6:8000 check maxconn 10000 inter 5s
You might need to tinker with the other timeouts to get a good mixture. Some timeouts that may affect your setup - and some starting values - are:
timeout http-keep-alive: 1s
timeout http-request: 15s
timeout queue: 30s
timeout tarpit: 60s
Of course, read up and customize to suit your needs.
Reference:
Haproxy - Websockets Load Balancing

uwsgi listen queue fills on reload

I'm running a Django app on uwsgi with an average of 110 concurrent users and 5 requests per second during peak hours. I'm finding that when I deploy with uwsgi reload during these peak hours I am starting to run into an issue where workers keep getting slowly killed and restarted, and then the uwsgi logs begin to throw an error:
Gracefully killing worker 1 (pid: 25145)...
Gracefully killing worker 2 (pid: 25147)...
... a few minutes go by ...
worker 2 killed successfully (pid: 25147)
Respawned uWSGI worker 2 (new pid: 727)
... a few minutes go by ...
worker 2 killed successfully (pid: 727)
Respawned uWSGI worker 2 (new pid: 896)
... this continues gradually for 25 minutes until:
*** listen queue of socket "127.0.0.1:8001" (fd: 3) full !!! (101/100) ***
At this point my app rapidly slows to a crawl and I can only recover with a hard uwsgi stop followed by a uwsgi start. There are some relevant details which make this situation kind of peculiar:
This only occurs when I uwsgi reload, otherwise the listen queue never fills up on its own
The error messages and slowdown only start to occur about 25 minutes after the reload
Even during the moment of crisis, memory and CPU resources on the machine seem fine
If I deploy during lighter traffic times, this issue does not seem to pop up
I realize that I can increase the listen queue size, but that seems like a band-aid more than an actual solution. And the fact that it only fills up during reload (and takes 25 minutes to do so) leads me to believe that it will fill up eventually regardless of the size. I would like to figure out the mechanism that is causing the queue to fill up and address that at the source.
Relevant uwsgi config:
[uwsgi]
socket = 127.0.0.1:8001
processes = 4
threads = 2
max-requests = 300
reload-on-rss = 800
vacuum = True
touch-reload = foo/uwsgi/reload.txt
memory-report = true
Relevant software version numbers:
uwsgi 2.0.14
Ubuntu 14.04.1
Django 1.11.13
Python 2.7.6
It appears that our touch reload is not graceful when we have slight traffic, is this to be expected or do we have a more fundamental issue?
On uwsgi there is a harakiri mode that will periodically kill long running processes to prevent unreliable code from hanging (and effectively taking down the app). I would suggest looking there for why your processes are being killed.
As to why a hard stop works and a graceful stop does not -- it seems to further indicate your application code is hanging. A graceful stop will send SIGHUP, which allows resources to be cleaned up in the application. SIGINT and SIGTERM follow the harsher guidelines of "stop what you are doing right now and exit".
Anyway, it boils down to this not being a uwsgi issue, but an issue in your application code. Find what is hanging and why. Since you are not noticing CPU spikes; some probable places to look are...
blocking connections
locks
a long sleep
Good luck!
The key thing you need to look is "listen queue of socket "127.0.0.1:8001" (fd: 3) full !!! (101/100)"
Default listen queue size is 100. Increase the queue size by adding the option "listen" in uwsgi.ini.
"listen = 4096"

504 gateway timeout flask socketio

I am working on a flask-socketio server which is getting stuck in a state where only 504s (gateway timeout) are returned. We are using AWS ELB in front of the server. I was wondering if anyone wouldn't mind giving some tips as to how to debug this issue.
Other symptoms:
This problem does not occur consistently, but once it begins happening, only 504s are received from requests. Restarting the process seems to fix the issue.
When I run netstat -nt on the server, I see many entries with rec-q's of over 100 stuck in the CLOSE_WAIT state
When I run strace on the process, I only see select and clock_gettime
When I run tcpdump on the server, I can see the valid requests coming into the server
AWS health checks are coming back succesfully
EDIT:
I should also add two things:
flask-socketio's server is used for production (not gunicorn or uWSGI)
Python's daemonize function is used for daemonizing the app
It seemed that switching to gunicorn as the wsgi server fixed the problem. This legitimately might be an issue with the flask-socketio wsgi server.

uWSGI downtime when restart

I have a problem with uwsgi everytime I restart the server when I have a code updates.
When I restart the uwsgi using "sudo restart accounting", there's a small gap between stop and start instance that results to downtime and stops all the current request.
When I try "sudo reload accounting", it works but my memory goes up (double). When I run the command "ps aux | grep accounting", it shows that I have 10 running processes (accounting.ini) instead of 5 and it freezes up my server when the memory hits the limit.
accounting.ini
I am running
Ubuntu 14.04
Django 1.9
nginx 1.4.6
uwsgi 2.0.12
This is how uwsgi does graceful reload. Keeps old processes until requests are served and creates new ones that will take over incoming requests.
Read Things that could go wrong
Do not forget, your workers/threads that are still running requests
could block the reload (for various reasons) for more seconds than
your proxy server could tolerate.
And this
Another important step of graceful reload is to avoid destroying
workers/threads that are still managing requests. Obviously requests
could be stuck, so you should have a timeout for running workers (in
uWSGI it is called the “worker’s mercy” and it has a default value of
60 seconds).
So i would recommend trying worker-reload-mercy
Default value is to wait 60 seconds, just lower it to something that your server can handle.
Tell me if it worked.
Uwsgi chain reload
This is another try to fix your issue. As you mentioned your uwsgi workers are restarting in a manner described below:
send SIGHUP signal to the master
Wait for running workers.
Close all of the file descriptors except the ones mapped to sockets.
Call exec() on itself.
One of the cons of this kind of reload might be stuck workers.
Additionaly you report that your server crashes when uwsgi maintains 10 proceses (5 old and 5 new ones).
I propose trying chain reload. DIrect quote from documentation explains this kind of reload best:
When triggered, it will restart one worker at time, and the following worker is not reloaded until the previous one is ready to accept new requests.
It means that you will not have 10 processes on your server but only 5.
Config that should work:
# your .ini file
lazy-apps = true
touch-chain-reload = /path/to/reloadFile
Some resources on chain reload and other kinds are in links below:
Chain reloading uwsgi docs
uWSGI graceful Python code deploy