I am working on a flask-socketio server which is getting stuck in a state where only 504s (gateway timeout) are returned. We are using AWS ELB in front of the server. I was wondering if anyone wouldn't mind giving some tips as to how to debug this issue.
Other symptoms:
This problem does not occur consistently, but once it begins happening, only 504s are received from requests. Restarting the process seems to fix the issue.
When I run netstat -nt on the server, I see many entries with rec-q's of over 100 stuck in the CLOSE_WAIT state
When I run strace on the process, I only see select and clock_gettime
When I run tcpdump on the server, I can see the valid requests coming into the server
AWS health checks are coming back succesfully
EDIT:
I should also add two things:
flask-socketio's server is used for production (not gunicorn or uWSGI)
Python's daemonize function is used for daemonizing the app
It seemed that switching to gunicorn as the wsgi server fixed the problem. This legitimately might be an issue with the flask-socketio wsgi server.
Related
My Django application run well for a while, then I got 502 Bad Gateway, after a few hours, I am unable to ping the domain and use SSH to connect my server(from Amazon Lightsail). My other application served by ngnix was also not available then. While if I didn't start the Django application, ther application served by ngnix would run steadily. So I guess it is the error of my Django application crashed ngnix and the server.
After rebooting the server for serveral times, the server seems recovered then I can ping the domain and use SSH to connect the server. But after a while, the same problem would occurs again. I wonder how to fix the problem.
Some diagnostic information since the start of the Django application to the end of the Nginx server provided below.
The RAM usage is high during the process.
The uwsgi log. https://bpa.st/FP7Q
The Nginx error log. https://bpa.st/35EQ
I'm truing to run a django server and all of a sudden I'm not able to go
to localhost:8000. I was able to a few seconds back, but now now it's just freezing up and saying "waiting for localhost"
I'm on a Mac OS X
How do I debug this?
Some links:
Waiting for localhost : getting this message on all browsers
Waiting for localhost, forever!
Why does my machine keeps waiting for localhost forever?
To summarise it - in general it means that the 1) server is waiting for input (e.g. not returning a response), 2) some other service might be running on the same port, 3) no DB connection.
However, that said a restart should sort all these out by killing all processes that might've taken the port and by restarting the DB and reconnecting properly.
I have a Django project running using gunicorn sock(not port).
I am using supervisor to run it. The problem is - supervisor is saying that the process is running. Logs doesnt show anything.
But site says "Bad gateway". Nginx generally shows bad gateway when the gunicorn is not running. But here, gunicorn is running without errors but nginx shows bad gateway.
If it uses port, I would have tested locally using "wget http://localhost:8000" but since we use sock here, how to test if its really running and why its not showing any error.
I have a Django site that uses Gunicorn and Nginx. Occasionally, I'll have a problem that I need to debug. In the past, I would shut down Gunicorn and Nginx, go to my Django project directory and start the Django development server ("python ./manage.py runserver 0:8000"), and then restart Nginx. I could then insert set_trace() commands and do my debugging. When I fixed the problem I'd shut down Nginx and then restart Gunicorn and Nginx. I'm pretty sure this was working.
Recently, though, I've begun having problems. What happens now is that when I've stopped at a breakpoint, after a couple of minutes the web page that I've stopped on will change and display "404 Not Found" and if I take another step in the debugger, I'll see this error:
- Broken pipe from ('127.0.0.1', 43742)
This happens on my development, staging, and production servers which I'm accessing via their domain names, e.g. "web01.example.com" (not really example).
What is the correct way to debug my Django application on my remote servers?
Thanks.
I figured out the problem. First I observed that when I stopped at a breakpoint, the page always timed out after exactly one minute which suggested that the Nginx connection to the web server was timing out if the web server took more than 60 seconds to respond. I then found an Nginx proxy_read_timeout directive which defines this timeout. Then it was merely a matter of changing the length of the timeout in my Nginx config file:
# /etc/nginx/sites-enabled/example.conf
http {
server {
...
location #django {
...
# Set timeout to 1 hour
proxy_read_timeout 3600s;
...
}
...
}
}
Once you've made this change you need to reload Nginx, not restart it, in order to this change to take effect. Then you start Django as I indicated above and you can now debug your Django application without it timing out. Just be sure to remove the timeout setting when you're done debugging, reload Nginx again, and restart Gunicorn.
I am trying to get Django-Celery running on my Django App. I cannot get the worker server to run. When I try I get the message: No Connection could be made because the target machine actively refused it
Here is what I have done so far. First, I installed the django celery package: http://pypi.python.org/pypi/django-celery
I can load it into python without problems. I also installed the RabbitMQ server per the windows install instructions: http://www.rabbitmq.com/install.html#windows
Starting the tutorials in pytho on the RabbitMQ site I saw the need to install pika: http://pypi.python.org/pypi/pika. It imports without any problems.
From there I start the RabbitMQ server by running this at the command line: rabbitmq-service start
I get the message back that Service RabbitMQ started
Here is where I start to have problems.
I attempted the first steps in django-celery: http://packages.python.org/django-celery/getting-started/first-steps-with-django.html and the "hello world" example on the rabbitMQ site: http://www.rabbitmq.com/tutorials/tutorial-one-python.html
In both cases I get the message: No Connection could be made because the target machine actively refused it
My first thought was that this sounded like a firewall problem. So I went into the windows 7 firewall and added inbound and outbound rules to open the local and remote ports 5672 and 5673 to TCP protocol, but I still get the same error message.
When I run rabbitmqctl status i get the message:
Error: unable to connect to node 'rabbit#hostname': nodedown
diagnostics:
- nodes and their ports on hostname: [{rabbitmqctl18856, 505031}]
Does that mean it that it is trying to operate on those ports? what about the default 5672?
Any suggestions?
UPDATE: This was actually a problem resulting from several failed rabbitmq installs conflicting with the latest installation. If you have to remove rabbitmq use the 'rabbitmq-service remove' command and not SC DELETE, which cause a lot of problems for me and I had to go in and clean up my windows registry file.
The nodedown error indicated by rabbitmqctl suggests that the server isn't running on that machine.
Try going though the steps in RabbitMQ's troubleshooting guide. In particular, pay close attention to the logs. Has the server crashed for some reason? Could you post the logs somewhere?