Django Channels: Get stuck after period of time - django

I run code from https://github.com/andrewgodwin/channels-examples/tree/master/multichat for around 50 users.
It goes to get stuck without any notice. Server is not down, access log has nothing special. When I stop daphne server (with Ctrl+C), it takes about 5-10 minutes to completely go down. Sometime I have to run kill command.
It is very weird when I put daphne inside supervisord, I restart it every 30 minutes using crontab, websocket can be connected normally. It's hacky but working.
My config: HAProxy => Daphne
daphne -b 192.168.0.6 -p 8000 yyapp.asgi:application --access-log=/home/admin/daphne.log
backend daphne
balance source
option http-server-close
option forceclose
timeout check 1000ms
reqrep ^([^\ ]*)\ /ws/(.*) \1\ /\2
server daphne 192.168.0.6:8000 check maxconn 10000 inter 5s
Debian: 9.4 (original kernel) on OVH server.
Python: 3.6.4
Daphne: 2.2.1
Channels: 2.1.2
Django: 1.11.15
Redis: 4.0.11
I know this question may be too general, but I really have no ideas with this. I tried upgrade python, re-install all the packages but it didn't work.

Well, web servers and load balancers are, in general, very bad with persistent connections. You need to give Haproxy explicit instructions so it knows when and how to timeout unused tunnels.
There are four timeouts that Haproxy will need to keep track of:
timeout client
timeout connect
timeout server
timeout tunnel
The first three are related to the initial HTTP negotiation phase of the socket connection. As soon as the connection is established, only timeout tunnel matters. You will need to tinker with the values for your own application, but some suggested values to start with are:
timeout client: 25s
timeout connect: 5s
timeout server: 25s
timeout tunnel: 3600s
In your code, that would be:
backend daphne
balance source
option http-server-close
option forceclose
timeout check 1000ms
timeout client 25s
timeout connect 5s
timeout server 25s
timeout tunnel 3600s
reqrep ^([^\ ]*)\ /ws/(.*) \1\ /\2
server daphne 192.168.0.6:8000 check maxconn 10000 inter 5s
You might need to tinker with the other timeouts to get a good mixture. Some timeouts that may affect your setup - and some starting values - are:
timeout http-keep-alive: 1s
timeout http-request: 15s
timeout queue: 30s
timeout tarpit: 60s
Of course, read up and customize to suit your needs.
Reference:
Haproxy - Websockets Load Balancing

Related

django: Localhost stopped working suddenly?

I'm truing to run a django server and all of a sudden I'm not able to go
to localhost:8000. I was able to a few seconds back, but now now it's just freezing up and saying "waiting for localhost"
I'm on a Mac OS X
How do I debug this?
Some links:
Waiting for localhost : getting this message on all browsers
Waiting for localhost, forever!
Why does my machine keeps waiting for localhost forever?
To summarise it - in general it means that the 1) server is waiting for input (e.g. not returning a response), 2) some other service might be running on the same port, 3) no DB connection.
However, that said a restart should sort all these out by killing all processes that might've taken the port and by restarting the DB and reconnecting properly.

uWSGI listen queue of socket full

My setup includes Load Balancer (haproxy) with two nginx servers running Django. Server 2 works fine but sometimes server 1 will start crashing and log will be full of
*** uWSGI listen queue of socket ":8000" (fd: 3) full !!! (101/100) ***
message.
How do I go about resolving this issue?
Your listen queue is full. When you run uwsgi, pass it --listen 1024 to increase the queue to 1024.
Note that a larger queue makes you more susceptible to a DDoS attack.
You may also need to increase net.core.somaxconn
sysctl -w net.core.somaxconn=65536

504 gateway timeout flask socketio

I am working on a flask-socketio server which is getting stuck in a state where only 504s (gateway timeout) are returned. We are using AWS ELB in front of the server. I was wondering if anyone wouldn't mind giving some tips as to how to debug this issue.
Other symptoms:
This problem does not occur consistently, but once it begins happening, only 504s are received from requests. Restarting the process seems to fix the issue.
When I run netstat -nt on the server, I see many entries with rec-q's of over 100 stuck in the CLOSE_WAIT state
When I run strace on the process, I only see select and clock_gettime
When I run tcpdump on the server, I can see the valid requests coming into the server
AWS health checks are coming back succesfully
EDIT:
I should also add two things:
flask-socketio's server is used for production (not gunicorn or uWSGI)
Python's daemonize function is used for daemonizing the app
It seemed that switching to gunicorn as the wsgi server fixed the problem. This legitimately might be an issue with the flask-socketio wsgi server.

uWSGI downtime when restart

I have a problem with uwsgi everytime I restart the server when I have a code updates.
When I restart the uwsgi using "sudo restart accounting", there's a small gap between stop and start instance that results to downtime and stops all the current request.
When I try "sudo reload accounting", it works but my memory goes up (double). When I run the command "ps aux | grep accounting", it shows that I have 10 running processes (accounting.ini) instead of 5 and it freezes up my server when the memory hits the limit.
accounting.ini
I am running
Ubuntu 14.04
Django 1.9
nginx 1.4.6
uwsgi 2.0.12
This is how uwsgi does graceful reload. Keeps old processes until requests are served and creates new ones that will take over incoming requests.
Read Things that could go wrong
Do not forget, your workers/threads that are still running requests
could block the reload (for various reasons) for more seconds than
your proxy server could tolerate.
And this
Another important step of graceful reload is to avoid destroying
workers/threads that are still managing requests. Obviously requests
could be stuck, so you should have a timeout for running workers (in
uWSGI it is called the “worker’s mercy” and it has a default value of
60 seconds).
So i would recommend trying worker-reload-mercy
Default value is to wait 60 seconds, just lower it to something that your server can handle.
Tell me if it worked.
Uwsgi chain reload
This is another try to fix your issue. As you mentioned your uwsgi workers are restarting in a manner described below:
send SIGHUP signal to the master
Wait for running workers.
Close all of the file descriptors except the ones mapped to sockets.
Call exec() on itself.
One of the cons of this kind of reload might be stuck workers.
Additionaly you report that your server crashes when uwsgi maintains 10 proceses (5 old and 5 new ones).
I propose trying chain reload. DIrect quote from documentation explains this kind of reload best:
When triggered, it will restart one worker at time, and the following worker is not reloaded until the previous one is ready to accept new requests.
It means that you will not have 10 processes on your server but only 5.
Config that should work:
# your .ini file
lazy-apps = true
touch-chain-reload = /path/to/reloadFile
Some resources on chain reload and other kinds are in links below:
Chain reloading uwsgi docs
uWSGI graceful Python code deploy

nginx/gunicorn throwing 504s (time outs) after some time - how to debug?

We have a nginx - gunicorn - django setup.
Server runs fine for a while and then nginx starts throwing
504 Gateway Time-outs.
Trying to access gunicorn locally (127.0.0.1:8000) with lynx does not work either.
Logging into the machine shows enough cpu, memory and disc space available:
CPU[|||| 3.3%]
Mem[|||||||||||||||||| 362/3750MB]
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 7.9G 5.5G 2.1G 73% /
Supervisor shows gunicorn is running:
gunicorn RUNNING pid 25264, uptime 2 days, 8:55:22
The database that sits underneath django is alive.
I looked through the django, supervisor, nginx and database logs and I couldn't find anything (!) suspicious.
[Update: Logs]
In the nginx.error.logs there is a few
client intended to send too large body
SSL_do_handshake() failed
errors every now and then and for sure
upstream timed out
recv() failed (104: Connection reset by peer) while reading response header from upstream
errors after gunicorn got stuck.
Versions
nginx/1.6.2
gunicorn==18.0
Django==1.6.2
supervisor==3.0
Any recommendations how to find out what/why that's happening?
(restarting gunicorn via supervisor fixes the issue)