How to solve AWS ELB/EC2 HTTP 503 with timeout settings? - amazon-web-services

I'm getting intermittent but regular 503 errors ("Service Unavailable: Back-end server is at capacity") from a site consisting of 2 t2.medium instances behind an ELB. None are under particularly heavy load and all monitoring seems normal.
The AWS docs here:
http://docs.aws.amazon.com/ElasticLoadBalancing/latest/DeveloperGuide/ts-elb-error-message.html
say that a potential cause is mismatched timeout settings between the ELB and EC2s: "set the keep-alive timeout to greater than or equal to the idle timeout settings of your load balancer"
Apache conf on the EC2s has:
KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 5
Idle timeout on the load balancer is 60 seconds.
This would seem, then, to be a cause, but I'm unsure about the fix. Increasing the Apache KeepAliveTimeout isn't - I understand - normally recommended, and I'm equally unsure about the effect that reducing the idle timeout on the ELB will have on site performance.
What's the recommended approach? How can I get an idea of what the ideal settings are for my setup and the sort of traffic level (currently about 30-50 requests/min) it deals with?

I would lower the idle timeout in the ELB. Clients will need to open new connections more often, but its just slightly slower than reusing an keepalive connection.
Raising the keepalive to 60 in Apache would also fix the 503, but you need to be careful not to run out of connections or memory, especially with prefork mpm because you will get a lot more slots used in keepalive. Use worker mpm (or event mpm if youre not scared of the "This MPM is experimental" warning), make sure you have high enough MaxClients to handle all requests but low enough to not run out of memo.

Related

Concurrent requests to Nginx Server

I am having a problem with my server dealing with a large volume of concurrent users signing on and operating at the same time. Our business case requires the user base to be logging in at the exact same time (1 min window) and performing various operations in our application. Once the server goes past a 1000 concurrent users, the website starts loading very slowly and constant giving 502 errors.
We have reviewed the server metrics (CPU, RAM, Data traffic utilization) on the cloud are most resources are operating at below 10%. Scaling up the server doesn't help.
While the website is constant giving 502 errors are responding after a long time, any direct database queries and SSH connection are working fine. As such we have concluded that issue is primarily focused on the number of concurrent requests the server is handling due to any Nginx or Gunicorn configuration we may have set up incorrectly.
Please advice on any possible incorrect configuration (or any other solution) to this issue :
Server info :
Server specs - AWS c4.8xlarge ( CPU and RAM)
Web Server -Nginx
Backend Server -Gunicorn
nginx conf imageGunicorn conf file

Intermittent 502 gateway errors with AWS ALB in front of ECS services running express / nginx

Backgound:
We are running a single page application being served via nginx with a node js (v12.10) backend running express. It runs as containers via ECS and currently we are running three t3a mediums as our container instances with the api and web services each running 6 replicas across these. We use an ALB to handle our load balancing / routing of requests. We run three subnets across 3 AZ's with the load balancer associated with all three and the instances spread across the 3 AZ's as well.
Problem:
We are trying to get to the root cause of some intermittent 502 errors that are appearing for both front and back end. I have downloaded the ALB access logs and the interesting thing about all of these requests is that they all show the following.
- request_processing_time: 0.000
- target_processing_time: 0.000 (sometimes this will be 0.001 or at most 0.004)
- response_processing_time: -1
At the time of these errors I can see that there were healthy targets available.
Now I know that some people have had issues like this with keepAlive times that were shorter on the server side than on the ALB side, therefore connections were being forceably closed that the ALB then tries to reuse (which is in line with the guidelines for troubleshooting on AWS). However when looking at the keepAlive times for our back end they are set higher than our ALB currently by double. Also the requests themselves can be replayed via chrome dev tools and they succeed (im not sure if this is a valid way to check a malformed request, it seemed reasonable).
I am very new to this area and if anyone has some suggestions as to where to look or what sort of tests to run that might help me pinpoint this issue it would be greatly appreciated. I have run some load tests on certain endpoints and duplicated the 502 errors, however the errors under heavy load differ from the intermittent ones I have seen on our logs in that the target_processing_time is quite high so to my mind this is another issue altogether. At this stage I would like to understand the errors that show a target_processing_time of basically zero to start with.
I wrote a blog post about this a bit over a year ago that's probably worth taking a look at (caused due to a behavior change in NodeJS 8+):
https://adamcrowder.net/posts/node-express-api-and-aws-alb-502/
TL;DR is you need to set the nodejs http.Server keepAliveTimeout (which is in ms) to be higher than the load balancer's idle timeout (which is in seconds).
Please also note that there is also something called an http-keepalive which sets an http header, which has absolutely nothing to do with this problem. Make sure you're setting the right thing.
Also note that there is currently a regression in nodejs where setting the keepAliveTimeout may not work properly. That bug is being tracked here: https://github.com/nodejs/node/issues/27363 and is worth looking through if you're still having this problem (you may need to also set headersTimeout as well).

Django on Apache - Prevent 504 Gateway Timeout

I have a Django server running on Apache via mod_wsgi. I have a massive background task, called via a API call, that searches emails in the background (generally takes a few hours) that is done in the background.
In order to facilitate debugging - as exceptions and everything else happen in the background - I created a API call to run the task blocking. So the browser actually blocks for those hours and receives the results.
In localhost this is fine. However, in the real Apache environment, after about 30 minutes I get a 504 Gateway Timeout error.
How do I change the settings so that Apache allows - just in this debug phase - for the HTTP request to block for a few hours without returning a 504 Gateway Timeout?
I'm assuming this can be changed in the Apache configuration.
You should not be doing long running tasks within Apache processes, nor even waiting for them. Use a background task queueing system such as Celery to run them. Have any web request return as soon as it is queued and implement some sort of polling mechanism as necessary to see if the job is complete and results can be obtained.
Also, are you sure the 504 isn't coming from some front end proxy (explicit or transparent) or load balancer? There is no default timeout in Apache which is 30 minutes.

why uwsgi workers idle but nginx showed a lot of timeouts?

Stack: nginx, uwsgi, django
uwsgitop and top both showed uwsgi workers are idle, while nginx error log said upstream timed out.
I thought some of request needed a lot of resources such as waiting for db or cache, while the others not. After checking the timed out requests, most of them were not voracious. Any kind of request had been timed out.
So why nginx didn't seed the requests to idle ones if the others were really busy? why uwsgi master just keep someone busy and the others idle?
I'd like to answer my own question.
change the kernel parameter: net.ipv4.ip_conntrack_max from 65560 to 6556000
I have a full story on how we found the answer:
user said slow, slow, slow
nginx flooded with "upstream connection timed out"
I checked uwsgi log, found some of errors, fixed it; found more, fix more, and this loop lasted days. Till yesterday, I thought there was no relevance with uwsgi, memcached, db, redis, or anything backend because uwsgi were idle
so I thought nginx must have had something wrong, reload, restart, check connections, workers, proxy_read_timeout, etc. no luck.
checked ulimit -n, which reported 1024, the default one. I have 8 nginx workers, so connections should reach to 1024 * 8, I thought that could be ok as nginx never said too many open files. Anyway, I changed it to 4096. NO luck.
check connections number, and the state, then problem appears. upstream connections were all in syn_sent state, then timeout happends. Only 2 or 3 of 300 connections are in established state. We wanted to know why. One of my friends told my part to use tcpdump, the magic tool I never dare to try once.
Then we go to syslog and found the following error, and finally we resolved the problem
I had a similar issue where my listen queue was growing and the RPS was low despite all the workers idling.
samuel found one of the cases but there are a few more potential causes for this behavior:
net.core.somaxconn not high enough
uwsgi --listen not high enough
nginx worker processes too low
nginx worker connections too low
If none of these work, then you'll want to check your logs that inbound requests to uWSGI are under http/1.1 and not http/1.0, and then use --http11-socket
Here are some of my findings when I wrestled with this issue: https://wontonst.blogspot.com/2019/06/squishing-performance-bug-in.html
The nginx tuning page also has some other configurations that may or may not be useful in solving this issue:
https://www.nginx.com/blog/tuning-nginx/

Normal CPU Usage But Slow Jetty Response Time During Peak Hour

We have several web servers running jetty to serve 100 request per second.
During peak hour, the response time of the jetty becomes slow and the number of request that the jetty handled is dropped.
We have checked that
- The cpu usage of the jetty is around 20-30% which is healthy.
- IO figure is normal
- no slow query in our DB and the DB server is healthy too.
- network is healthy.
By adding more web servers, the problem is solved.
But I don't understand why the CPU usage is not rised when the web traffic loading is heavy?
Does anyone has similar experience?