DDoS crashes my site that uses Gunicorn in a docker container, nginx throws connection refused errors, yet Gunicorn is still running? - django

I am running a Django site with Gunicorn inside a docker container.
Requests are forwarded to this container by nginx, which is running non-dockerized as a regular Ubuntu service.
My site sometimes comes under heavy DDoS attacks that cannot be prevented. I have implemented a number of measures, including Cloudflare, nginx rate limits, gunicorn's own rate limits, and fail2ban as well. Ultimately, these attacks manage to get through due to the sheer number of IP addresses that appear to be in the botnet.
I'm not running anything super-critical, and I will later be looking into load balancing and other options. However, my main issue is that the DDoS attacks do not just take down my site - it's that the site doesn't restore availability when the attack is over.
Somehow, the sheer number of requests is breaking something, and I cannot figure it out. The only way to bring the site back is to restart the container. Nginx service is running just fine, and shows the following in the error logs every time:
2022/08/02 18:03:07 [error] 2115246#2115246: *72 connect() failed (111: Connection refused) while connecting to upstream, client: 172.104.109.161, server: examplesite.com, request: "GET / HTTP/2.0", upstream: "http://127.0.0.1:8000/", host: "examplesite.com"
From this, I thought that somehow the DDoS was crashing the docker container with gunicorn and the django app. Hence, I implemented a health check in the Dockerfile:
HEALTHCHECK --interval=60s --timeout=5s --start-period=5s --retries=3 \
CMD curl -I --fail http://localhost:8000/ || exit 1
I used Docker Autoheal to monitor the health of the container, however the container never turns "unhealthy". Manually running the command curl http://localhost:8000/ returns the website's home page, hence why the container is never turning unhealthy.
Despite this, the container does not appear to be accepting any more requests from nginx, as this is the only output from gunicorn (indicating that it receives the healthcheck curl, but nothing else):
172.17.0.1 - - [02/Aug/2022:15:34:49 +0000] "GET / HTTP/1.0" 403 135 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3599.0 Safari/537.36"
[2022-08-02 15:34:49 +0000] [1344] [INFO] Autorestarting worker after current request.
[2022-08-02 15:34:49 +0000] [1344] [INFO] Worker exiting (pid: 1344)
172.17.0.1 - - [02/Aug/2022:15:34:49 +0000] "GET / HTTP/1.0" 403 135 "-" "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
[2022-08-02 15:34:50 +0000] [1447] [INFO] Booting worker with pid: 1447
[2022-08-02 15:34:50 +0000] [1448] [INFO] Booting worker with pid: 1448
[2022-08-02 15:34:51 +0000] [1449] [INFO] Booting worker with pid: 1449
127.0.0.1 - - [02/Aug/2022:15:35:31 +0000] "HEAD / HTTP/1.1" 200 87301 "-" "curl/7.74.0"
127.0.0.1 - - [02/Aug/2022:15:36:31 +0000] "HEAD / HTTP/1.1" 200 87301 "-" "curl/7.74.0"
127.0.0.1 - - [02/Aug/2022:15:37:31 +0000] "HEAD / HTTP/1.1" 200 87301 "-" "curl/7.74.0"
127.0.0.1 - - [02/Aug/2022:15:51:33 +0000] "HEAD / HTTP/1.1" 200 87301 "-" "curl/7.74.0"
[2022-08-02 15:51:54 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:1449)
[2022-08-02 15:51:54 +0000] [1449] [INFO] Worker exiting (pid: 1449)
127.0.0.1 - - [02/Aug/2022:15:52:33 +0000] "HEAD / HTTP/1.1" 200 87301 "-" "curl/7.74.0"
127.0.0.1 - - [02/Aug/2022:15:53:34 +0000] "HEAD / HTTP/1.1" 200 87301 "-" "curl/7.74.0"
As you can see, no more non-curl requests are received by gunicorn after 15:34:49. Nginx continues to show the upstream connection refused error. What can I do about this? Manually restarting the docker container is simply not feasible - the health check should work, but for some reason the site still works internally, but the docker container is not receiving the outside requests from nginx.
I've tried varying the gunicorn workers and number of requests per worker, but nothing works. The site works perfectly fine normally, I am just completely stuck on where the ddos is breaking something. From my observation, nginx is functioning fine, and the issue is somewhere with the dockerised gunicorn instance, but I don't know how given it responds to internal curl commands perfectly fine - if it was broken, the health check wouldn't be able to access the site!
Edit, extract of my nginx config:
server {
listen 443 ssl http2;
server_name examplesite.com www.examplesite.com;
ssl_certificate /etc/ssl/cert.pem;
ssl_certificate_key /etc/ssl/key.pem;
ssl_client_certificate /etc/ssl/cloudflare.pem;
ssl_verify_client on;
client_body_timeout 5s;
client_header_timeout 5s;
location / {
limit_conn limitzone 15;
limit_req zone=one burst=10 nodelay;
proxy_pass http://127.0.0.1:8000/;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_redirect http://127.0.0.1:8000/ https://examplesite.com;
}
}
Update: No unreasonable use of system resources from the container, still unsure where in the pipeline it's breaking

Related

websocket closes connection on google cloud platform

I have a websocket java application installed on VM of compute engine of google cloud platform. If the request goes through google load balancer the websocket closed automatically, however if I access using IP it works fine. I have increased the backend service timeout to 86400 seconds but it does not solve the issue any clues pls ?
location /openWebSocket {
proxy_pass http://127.0.0.1:8080;
proxy_http_version 1.1;
proxy_set_header Connection $connection_upgrade;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_read_timeout 7d;
proxy_send_timeout 7d;
keepalive_timeout 7d;
}
Below is the nginx access.log info
[21/Jan/2019:21:16:57 +0000] "GET /openWebSocket HTTP/1.1"
101 0 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
[21/Jan/2019:21:17:00 +0000] "GET / HTTP/1.1" 200 612 "-"
"GoogleHC/1.0"
[21/Jan/2019:21:17:00 +0000] "GET / HTTP/1.1" 200 612 "-"
"GoogleHC/1.0"
[21/Jan/2019:21:17:00 +0000] "GET / HTTP/1.1" 200 612 "-"
"GoogleHC/1.0"
According to the documentation, GCP Load balancer supports websockets natively and can act as a proxy as long as the following conditions are met :
After a successful connection, the client issues a websocket Upgrade request.
The backend (In your case the Java app) issues a successful websocket Upgrade response.
I would check in the GCP load balancer logs or in the Java app logs if there is an error during the websocket handshake. I would look for an error similar to this
If so, I would suggest checking your code handling of the Websocket Upgrade response.
Hope it helps.
Increasing the timeout values on the backend usually resolves these types of issues.
You can also look into Session Affinity which is still in beta at the moment.

Django auth views causing Http502 (with Gunicorn + Nginx)

I'm getting 502s when I try to access any of the views that rely on auth (so /admin/login/, posting to my own /login/ page, etc). It's not occurring on any other views/requests.
Here's the nginx acess log:
GET /admin/login/ HTTP/1.1" 502 182 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0
Not much that's going to help me there. Here's an extract from the gunicorn log (the first line is a worker booting from when the last one died):
[2018-01-15 19:44:43 +0000] [4775] [INFO] Booting worker with pid: 4775
[2018-01-15 19:46:10 +0000] [4679] [CRITICAL] WORKER TIMEOUT (pid:4775)
[2018-01-15 19:46:10 +0000] [4775] [INFO] Worker exiting (pid: 4775)
What's causing me to lose workers and get 502s?
Edit: I'm using django 2.0.1, django-axes 4.0.1. I'm pretty sure this is an axes issue, but I don't know how to diagnose it.
Thanks to #kichik I enabled debug logging, and discovered that the views were throwing a "WSGIRequest has no attribute 'user'" exception due to me using middleware settings in pre-django2 format. This answer solved the issue.

Block User-Agent on AWS Elastic Beanstalk

I am running a Django application on AWS Elastic Beanstalk. I keep having alerts (for days now) because the following user agent constantly tries to access some pages: "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:55 +0000] "HEAD /pma/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:55 +0000] "HEAD /db/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:56 +0000] "HEAD /admin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:56 +0000] "HEAD /mysql/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:56 +0000] "HEAD /database/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:57 +0000] "HEAD /db/phpmyadmin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:57 +0000] "HEAD /db/phpMyAdmin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:57 +0000] "HEAD /sqlmanager/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:58 +0000] "HEAD /mysqlmanager/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:58 +0000] "HEAD /php-myadmin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:58 +0000] "HEAD /phpmy-admin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:59 +0000] "HEAD /mysqladmin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:59 +0000] "HEAD /mysql-admin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:59 +0000] "HEAD /admin/phpmyadmin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
It used to return 404, but I managed to block it to 403 thanks to the following line in settings.py:
DISALLOWED_USER_AGENTS = (re.compile(r'Mozilla\/5.0 Jorgee'), )
Is there a way to simply block it from even getting to the Django level? Or a way to stop getting it written to the logs? It generates Health Check alerts :-/
You could create an AWS Web Application Firewall with a rule to reject traffic using that user agent string. Then attach the WAF to the Elastic Load Balancer in your Elastic Beanstalk environment.
Alternatively, you could create a rule in the reverse-proxy running on your Elastic Beanstalk EC2 instances to block that traffic before it gets to Django. I'm not sure if Django apps on EB default to using Apache or Nginx for the reverse proxy. You'd have to figure out which one you are using and then look up how to configure that to block traffic based on a user-agent string.
It's not clear to me how this traffic is causing health check alerts in your application. If it is spamming your app with so much traffic that your server becomes overloaded and unresponsive, then I would recommend using a WAF to block it so that your server(s) will never have to see the traffic at all.
You could use AWS WAF rules to block anything you want from there. Nevertheless I assume that you've created environment with Classic Load Balancer. Migration to the Application Load Balancer is not a pain indeed. And what it does is way more than Classic Load Balancer, with some chances for upgrade in the future. If you would consider moving out to the ALB (we did this long time ago, so every new environment has it), the there's easy way to block these things out, or just move them aside so they don't interrupt.
For us, the easiest approach was to define AWS WAF rule to have string condition matching the Host header against some defined domains we publish our sites with. If there's any request, that is not using this domain, it will get 403 and won't generate things like requests to the ELB are failing with 5xx on Elastic Beanstalk. If you take a closer look at logs on ELB level, these bots tend to use direct IP adresses, not DNS names, so it generates errors when accessing website with defined server_names in Nginx, for instance.
So thanks to this, we have clearer logs and there's no more health warning on EB environment, which was a standard before.
I described this case in details here https://www.mkubaczyk.com/2017/10/10/use-aws-waf-block-bots-jorgee-500-status-elastic-beanstalk if you need some guidance or screenshots.
Try to use ModSecurity , Where you can control your traffic to apache server
https://github.com/SpiderLabs/ModSecurity/wiki/Reference-Manual
https://github.com/SpiderLabs/owasp-modsecurity-crs/tree/v3.0/master/rules
add user agent to be blocked to this file in your server
https://github.com/SpiderLabs/owasp-modsecurity-crs/blob/v3.0/master/rules/crawlers-user-agents.data

Show the Real IP in the Logs of a Keter Managed App

I would like to display the actual IP of the request as opposed to the localhost in my log files. Since Keter manages the Nginx config I am not sure what I need to change to get the real ip.
This is what I see now:
127.0.0.1 - - [11/Jan/2014:09:25:08 +0000] "GET /favicon.ico HTTP/1.1" 200 - ""
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:27.0) Gecko/20100101 Firefox/27.0"
Keter hasn't been based on nginx for quite a while now. Recent versions of Keter set the X-Real-IP request header to containing the client's IP address (see issue #8), which you can use in wai-extra via IPAddrSource.

Production Django Application Throwing/Not Throwing 500 Error based on Debug = Value

I have an production django application that runs fine with Debug = True; but doesn't with Debug=False.
If I load run the domain, it shows my urls.py file, which is really bad.
I want to get my application where it uses Debug=False and TEMPLATE_DEBUG=False instead of Debug=True and TEMPLATE_DEBUG=True , since by using the True value it exposes the application
If I view my error.log under nginx with DEBUG=True:
2013/10/25 11:35:34 [error] 2263#0: *5 connect() failed (111: Connection refused) while connecting to upstream, client: xx.xxx.xx.xxx, server: *.myapp.com, request: "GET / HTTP/1.1", upstream: "http://127.0.0.1:8001/", host: "www.myapp.com"
view my access.log under nginx with DEBUG=True:
xx.xxx.xx.xxx - - [25/Oct/2013:11:35:33 +0000] "GET / HTTP/1.1" 502 173 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Firefox/24.0"
So my question is, why when I set DEBUG=True and TEMPLATE_DEBUG=True does it load successfully showing the application and when I set DEBUG=False and TEMPLATE_DEBUG=False it shows the custom http 500 error page? (I have created to handle http 500 errors)
Thanks to Toad013 and Dmitry for their suggestions.
It appears the issue might have been with how nginx and gunicorn were being started and not a configuration issue, thus, I ended up using the following to start my app:
/usr/local/bin/gunicorn -c /home/ubuntu/virtualenv/gunicorn_config.py myapp.wsgi
sudo nginx -c /etc/nginx/nginx.conf