Nginx proxy on Elastic Beanstalk EC2 instance times out - amazon-web-services

I've got a Docker application up and running on Elastic Beanstalk, and have noticed that occasionally it will time out and return 504 Gateway Timeout to requests from the Internet. In order to debug this, I created a script that I'm running on my local machine that constantly sends the exact same request to my application and tells me if it received anything but a 200 response. This is an extract of the output:
[21:38:14] Success
[21:38:15] Success
[21:38:15] Success
[21:39:15] Non 200 code: 504
[21:40:15] Non 200 code: 504
[21:41:15] Non 200 code: 504
[21:42:15] Non 200 code: 504
[21:43:15] Non 200 code: 504
[21:44:15] Success
[21:44:15] Success
[21:44:16] Success
I was wondering if the 504 requests even hit my EC2 instance, so I checked its nginx logs:
172.31.8.21 - [01/Mar/2015:21:38:15 +0000] "GET /v1/jobs HTTP/1.1" 200 5
172.31.8.21 - [01/Mar/2015:21:38:15 +0000] "GET /v1/jobs HTTP/1.1" 200 5
172.31.8.21 - [01/Mar/2015:21:38:15 +0000] "GET /v1/jobs HTTP/1.1" 200 5
172.31.23.11 - [01/Mar/2015:21:39:15 +0000] "GET /v1/jobs HTTP/1.1" 400 0
172.31.23.11 - [01/Mar/2015:21:40:15 +0000] "GET /v1/jobs HTTP/1.1" 400 0
172.31.23.11 - [01/Mar/2015:21:41:15 +0000] "GET /v1/jobs HTTP/1.1" 400 0
172.31.23.11 - [01/Mar/2015:21:42:15 +0000] "GET /v1/jobs HTTP/1.1" 400 0
172.31.23.11 - [01/Mar/2015:21:43:15 +0000] "GET /v1/jobs HTTP/1.1" 400 0
Looking at the logs, it seems that the load balancer sends a request to my EC2 instance, which then times out after 60 seconds which makes the load balancer return 504 and logs 400 into my nginx logs. What's really strange is that the IP address of the 504 requests is different from the IP address of every other successful request.
As I understand it, this is the topology of how a request is handled in a Docker Elastic Beanstalk application:
request -> load balancer -> load balancer proxy -> EC2 instance -> EC2 instance proxy -> Docker image
So I checked the nginx logs inside my Docker image, and there are no access.log entries for the time period when the load balancer was returning 504s.
That means that for some reason, requests are getting to the EC2 instance proxy and then timing out. Has anybody experienced this problem and knows why? I'd also like to know why the timeouts only happen when the 172.31.23.11 IP address is the one doing the requesting.
It's also worth noting that nginx on my EC2 instance reports nothing in its error.log when the timeouts occur.

Related

Nginx 10 minute timeout occurring

I have an AWS-hosted web application that initiates a long-running server process (more than 10 mintues). An Nginx reverse proxy server sits between the application load balancer (ALB) and the service. Both the Nginx server and the service reside within separate Kubernetes pods running on an EC2 instance.
I'm experiencing an issue with a connection being closed. The Nginx logs show a HTTP 499 error:
(][05/Dec/2022:12:02:27 +0000] "POST -------------- HTTP/1.1" 499 0
"https://------------.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X
10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0
Safari/537.36")
The issue is repeatable and occurs exactly 10 minutes after the request was initiated. Despite my having set the ALB, Nginx, and SQL Alchemy timeouts to be much longer than 10 minutes, I suspect a timeout is occurring with a default value of 10 minutes, but I can't figure out where.
Nginx is the product I'm least familiar with and so I suspect that I have failed to make the necessary timeout configs in its conf file. I have set this:
proxy_read_timeout 20m;
Can anyone suggest where in the system the default timeout is occurring?

DDoS crashes my site that uses Gunicorn in a docker container, nginx throws connection refused errors, yet Gunicorn is still running?

I am running a Django site with Gunicorn inside a docker container.
Requests are forwarded to this container by nginx, which is running non-dockerized as a regular Ubuntu service.
My site sometimes comes under heavy DDoS attacks that cannot be prevented. I have implemented a number of measures, including Cloudflare, nginx rate limits, gunicorn's own rate limits, and fail2ban as well. Ultimately, these attacks manage to get through due to the sheer number of IP addresses that appear to be in the botnet.
I'm not running anything super-critical, and I will later be looking into load balancing and other options. However, my main issue is that the DDoS attacks do not just take down my site - it's that the site doesn't restore availability when the attack is over.
Somehow, the sheer number of requests is breaking something, and I cannot figure it out. The only way to bring the site back is to restart the container. Nginx service is running just fine, and shows the following in the error logs every time:
2022/08/02 18:03:07 [error] 2115246#2115246: *72 connect() failed (111: Connection refused) while connecting to upstream, client: 172.104.109.161, server: examplesite.com, request: "GET / HTTP/2.0", upstream: "http://127.0.0.1:8000/", host: "examplesite.com"
From this, I thought that somehow the DDoS was crashing the docker container with gunicorn and the django app. Hence, I implemented a health check in the Dockerfile:
HEALTHCHECK --interval=60s --timeout=5s --start-period=5s --retries=3 \
CMD curl -I --fail http://localhost:8000/ || exit 1
I used Docker Autoheal to monitor the health of the container, however the container never turns "unhealthy". Manually running the command curl http://localhost:8000/ returns the website's home page, hence why the container is never turning unhealthy.
Despite this, the container does not appear to be accepting any more requests from nginx, as this is the only output from gunicorn (indicating that it receives the healthcheck curl, but nothing else):
172.17.0.1 - - [02/Aug/2022:15:34:49 +0000] "GET / HTTP/1.0" 403 135 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3599.0 Safari/537.36"
[2022-08-02 15:34:49 +0000] [1344] [INFO] Autorestarting worker after current request.
[2022-08-02 15:34:49 +0000] [1344] [INFO] Worker exiting (pid: 1344)
172.17.0.1 - - [02/Aug/2022:15:34:49 +0000] "GET / HTTP/1.0" 403 135 "-" "Mozilla/5.0 (iPad; CPU OS 8_1_1 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12B435 Safari/600.1.4"
[2022-08-02 15:34:50 +0000] [1447] [INFO] Booting worker with pid: 1447
[2022-08-02 15:34:50 +0000] [1448] [INFO] Booting worker with pid: 1448
[2022-08-02 15:34:51 +0000] [1449] [INFO] Booting worker with pid: 1449
127.0.0.1 - - [02/Aug/2022:15:35:31 +0000] "HEAD / HTTP/1.1" 200 87301 "-" "curl/7.74.0"
127.0.0.1 - - [02/Aug/2022:15:36:31 +0000] "HEAD / HTTP/1.1" 200 87301 "-" "curl/7.74.0"
127.0.0.1 - - [02/Aug/2022:15:37:31 +0000] "HEAD / HTTP/1.1" 200 87301 "-" "curl/7.74.0"
127.0.0.1 - - [02/Aug/2022:15:51:33 +0000] "HEAD / HTTP/1.1" 200 87301 "-" "curl/7.74.0"
[2022-08-02 15:51:54 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:1449)
[2022-08-02 15:51:54 +0000] [1449] [INFO] Worker exiting (pid: 1449)
127.0.0.1 - - [02/Aug/2022:15:52:33 +0000] "HEAD / HTTP/1.1" 200 87301 "-" "curl/7.74.0"
127.0.0.1 - - [02/Aug/2022:15:53:34 +0000] "HEAD / HTTP/1.1" 200 87301 "-" "curl/7.74.0"
As you can see, no more non-curl requests are received by gunicorn after 15:34:49. Nginx continues to show the upstream connection refused error. What can I do about this? Manually restarting the docker container is simply not feasible - the health check should work, but for some reason the site still works internally, but the docker container is not receiving the outside requests from nginx.
I've tried varying the gunicorn workers and number of requests per worker, but nothing works. The site works perfectly fine normally, I am just completely stuck on where the ddos is breaking something. From my observation, nginx is functioning fine, and the issue is somewhere with the dockerised gunicorn instance, but I don't know how given it responds to internal curl commands perfectly fine - if it was broken, the health check wouldn't be able to access the site!
Edit, extract of my nginx config:
server {
listen 443 ssl http2;
server_name examplesite.com www.examplesite.com;
ssl_certificate /etc/ssl/cert.pem;
ssl_certificate_key /etc/ssl/key.pem;
ssl_client_certificate /etc/ssl/cloudflare.pem;
ssl_verify_client on;
client_body_timeout 5s;
client_header_timeout 5s;
location / {
limit_conn limitzone 15;
limit_req zone=one burst=10 nodelay;
proxy_pass http://127.0.0.1:8000/;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_redirect http://127.0.0.1:8000/ https://examplesite.com;
}
}
Update: No unreasonable use of system resources from the container, still unsure where in the pipeline it's breaking

Block User-Agent on AWS Elastic Beanstalk

I am running a Django application on AWS Elastic Beanstalk. I keep having alerts (for days now) because the following user agent constantly tries to access some pages: "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:55 +0000] "HEAD /pma/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:55 +0000] "HEAD /db/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:56 +0000] "HEAD /admin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:56 +0000] "HEAD /mysql/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:56 +0000] "HEAD /database/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:57 +0000] "HEAD /db/phpmyadmin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:57 +0000] "HEAD /db/phpMyAdmin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:57 +0000] "HEAD /sqlmanager/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:58 +0000] "HEAD /mysqlmanager/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:58 +0000] "HEAD /php-myadmin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:58 +0000] "HEAD /phpmy-admin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:59 +0000] "HEAD /mysqladmin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:59 +0000] "HEAD /mysql-admin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
[30/Jul/2017:13:55:59 +0000] "HEAD /admin/phpmyadmin/ HTTP/1.1" 403 - "-" "Mozilla/5.0 Jorgee"
It used to return 404, but I managed to block it to 403 thanks to the following line in settings.py:
DISALLOWED_USER_AGENTS = (re.compile(r'Mozilla\/5.0 Jorgee'), )
Is there a way to simply block it from even getting to the Django level? Or a way to stop getting it written to the logs? It generates Health Check alerts :-/
You could create an AWS Web Application Firewall with a rule to reject traffic using that user agent string. Then attach the WAF to the Elastic Load Balancer in your Elastic Beanstalk environment.
Alternatively, you could create a rule in the reverse-proxy running on your Elastic Beanstalk EC2 instances to block that traffic before it gets to Django. I'm not sure if Django apps on EB default to using Apache or Nginx for the reverse proxy. You'd have to figure out which one you are using and then look up how to configure that to block traffic based on a user-agent string.
It's not clear to me how this traffic is causing health check alerts in your application. If it is spamming your app with so much traffic that your server becomes overloaded and unresponsive, then I would recommend using a WAF to block it so that your server(s) will never have to see the traffic at all.
You could use AWS WAF rules to block anything you want from there. Nevertheless I assume that you've created environment with Classic Load Balancer. Migration to the Application Load Balancer is not a pain indeed. And what it does is way more than Classic Load Balancer, with some chances for upgrade in the future. If you would consider moving out to the ALB (we did this long time ago, so every new environment has it), the there's easy way to block these things out, or just move them aside so they don't interrupt.
For us, the easiest approach was to define AWS WAF rule to have string condition matching the Host header against some defined domains we publish our sites with. If there's any request, that is not using this domain, it will get 403 and won't generate things like requests to the ELB are failing with 5xx on Elastic Beanstalk. If you take a closer look at logs on ELB level, these bots tend to use direct IP adresses, not DNS names, so it generates errors when accessing website with defined server_names in Nginx, for instance.
So thanks to this, we have clearer logs and there's no more health warning on EB environment, which was a standard before.
I described this case in details here https://www.mkubaczyk.com/2017/10/10/use-aws-waf-block-bots-jorgee-500-status-elastic-beanstalk if you need some guidance or screenshots.
Try to use ModSecurity , Where you can control your traffic to apache server
https://github.com/SpiderLabs/ModSecurity/wiki/Reference-Manual
https://github.com/SpiderLabs/owasp-modsecurity-crs/tree/v3.0/master/rules
add user agent to be blocked to this file in your server
https://github.com/SpiderLabs/owasp-modsecurity-crs/blob/v3.0/master/rules/crawlers-user-agents.data

Production Django Application Throwing/Not Throwing 500 Error based on Debug = Value

I have an production django application that runs fine with Debug = True; but doesn't with Debug=False.
If I load run the domain, it shows my urls.py file, which is really bad.
I want to get my application where it uses Debug=False and TEMPLATE_DEBUG=False instead of Debug=True and TEMPLATE_DEBUG=True , since by using the True value it exposes the application
If I view my error.log under nginx with DEBUG=True:
2013/10/25 11:35:34 [error] 2263#0: *5 connect() failed (111: Connection refused) while connecting to upstream, client: xx.xxx.xx.xxx, server: *.myapp.com, request: "GET / HTTP/1.1", upstream: "http://127.0.0.1:8001/", host: "www.myapp.com"
view my access.log under nginx with DEBUG=True:
xx.xxx.xx.xxx - - [25/Oct/2013:11:35:33 +0000] "GET / HTTP/1.1" 502 173 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Firefox/24.0"
So my question is, why when I set DEBUG=True and TEMPLATE_DEBUG=True does it load successfully showing the application and when I set DEBUG=False and TEMPLATE_DEBUG=False it shows the custom http 500 error page? (I have created to handle http 500 errors)
Thanks to Toad013 and Dmitry for their suggestions.
It appears the issue might have been with how nginx and gunicorn were being started and not a configuration issue, thus, I ended up using the following to start my app:
/usr/local/bin/gunicorn -c /home/ubuntu/virtualenv/gunicorn_config.py myapp.wsgi
sudo nginx -c /etc/nginx/nginx.conf

Nginx Bad Gateway with Django Social Auth and uwsgi

My site is running correctly locally (using the built in runserver), but when running with nginx and uwsgi, I'm getting a Bad Gateway (502) during the django-social-auth redirect.
The relevant nginx error_log:
IPREMOVED - - [11/Oct/2012:12:10:18 +1100] "GET /complete/google/? ..snip .. HTTP/1.1" 502 574 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_4) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.26 Safari/537.11"
The uwsgi log:
invalid request block size: 4204 (max 4096)...skip
Thu Oct 11 12:16:46 2012 - error parsing request
Refreshing the Bad Gateway response redirects and logs in correctly. This happens every single time. The nginx and uwsgi logs here have different timing as they were separate requests. The logs are consistent.
This is the first time deploying django to nginx for me, so I'm at a loss as to where to start.
Have you tried increasing the size of the uwsgi buffer:
-b 32768
http://comments.gmane.org/gmane.comp.python.wsgi.uwsgi.general/1171