HTTP 504 errors returned by ELB even when hosts are healthy and able to serve request - amazon-web-services

I have a service which is deployed on Amazon Web Services (AWS), specifically 2 instances behind an Elastic Load Balancer (ELB). Availability zones are selected as all three us-west-2a,b,c
but only 2 of the above 3 zones have instances running in it.
The issues is that even though the traffic/load is not too high but I still get HTTP 504 errors from ELB often enough.
The log lines reads like this
-1 -1 -1 504 0 0 0
In order, --request_processing_time --backend_processing_time --response_processing_time --elb_status_code --backend_status_code --received_bytes --sent_bytes.
Description of what each field and response means can be found here
ELB idle timeout is 60 seconds. KeepAlive is 'On' on backend instances. Latency of requests from ELB are in check. I have tried increasing KeepAliveTimeout but to no avail.
Does anyone have any idea about how to proceed? I don't even know the root cause of this issue.
PS: More like a second question, there are a few cases (much less than 504 being returned by ELB when backend does not even accept the request) where even backend is returning a 504 and then ELB is forwarding the same to client. To the best of my knowledge HTTP 504 should be returned by a proxy only when backend is timing out. How can a server itself return a 504?

So that it might assist others in future, I am publishing my finding(s) here:
1) This 504 0 HTTP errors were mainly because of logrotate reloading apache instead of graceful restart.
The current AWS config does the following
/sbin/service httpd reload > /dev/null 2>/dev/null || true
so replace the service command with either apachectl -k graceful or /sbin/service httpd graceful
File location on my ec2 instance: /etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf
2) Because logrotate frequency was too high by default in AWS (once every hour), at least for my use case, and that in turn was reloading apache every hour, so I reduced that as well.

When backend connection timeout, ELB will put -1 to backend_processing_time column in its access log. Think what's happening is that some of your requests take longer than 60s for your backend to process. To confirm this, can you check your latency metrics? Please switch to maximum when viewing this metric. It will confirm my guess if you see latency frequently reaches 60s.
After it got confirmed, you might want to increase Idle timeout of your ELB and backend.

Related

EC2 Instance giving better output than ELB

We have cluster of 3 EC2 instances. Single EC2 instance is able to server aroung 500 user load on application. But when same EC2 instnace is put in ELB is not even serving for 250 users. We drilled more & put below configuration at different end.
Optimized code to respond in less time.
ELB is set with 300 sec timeout for all responses & healthy.unhealthy checks.
Apache on EC2 is set with 600 as timeout value & keep alive it set true.
ELB is routing request in equal distribution logic.
Every time we hit with higher load(500 on cluster) we are getting end up getting some failures with 504 bad gateway timeout error. Kindly help with solution to get more optimial output.

AWS/EKS: Getting frequent 504 gateway timeout errors from ALB

I'm using EKS to deploy a service, with ingress running on top of alb-ingress-controller.
All in all I have about 10 replicas of a single pod, with a single service of type NodePort which forwards traffic to them. The replicas run on 10 nodes, established with eksctl, and spread across 3 availability zones.
The problem I'm seeing is very strange - inside the cluster, all the logs show that requests are being handled in less than 1s, mostly around 20-50 millis. I know this because I used linkerd to show the percentiles of request latencies, as well as the app logs themselves. However, the ALB logs/monitoring tell a very different story. I see a relatively high request latency (often approaching 20s or more), and often also 504 errors returned from the ELB (sometimes 2-3 every 5 minutes).
When trying to read the access logs for the ALB, I noticed that the 504 lines look like this:
https 2019-12-10T14:56:54.514487Z app/1d061b91-XXXXX-3e15/19297d73543adb87 52.207.101.56:41266 192.168.32.189:30246 0.000 -1 -1 504 - 747 308 "GET XXXXXXXX" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-1:750977848747:targetgroup/1d061b91-358e2837024de757a3d/e59bbbdb58407de3 "Root=1-5defb1fa-cbcdd248dd043b5bf1221ad8" "XXXX" "XXXX" 1 2019-12-10T14:55:54.514000Z "forward" "-" "-" "192.168.32.189:30246" "-"
Where the request processing time is 0 and the target processing time is -1, indicating the request never made it to the backend, and response was returned immediately.
I tried to play with the backend HTTP keepalive timeout (currently at 75s) and with the ALB idle time (currently at 60s) but nothing seems to change much for this behavior.
If anyone can point me to how to proceed and investigate this, or what the cause can be, I'd appreciate it very much.
We faced a similar type of issue with EKS and ALB combination. If the target response code says -1, there may be a chance that the request waiting queue is full on the target side. So the ALB will immediately drop the request.
Try to do an ab benchmark by skipping the ALB and directly send the request to the service or the private IP address. Doing this will help you to identify where the problem is.
For us, 1 out of 10 requests failed if we send traffic via ALB. We are not seeing failures if we directly send the request to the service.
AWS recommendation is to use NLB over the ALB. NLB gives more advantages and suitable for Kubernetes. There is a blog which explains this Using a Network Load Balancer with the NGINX Ingress Controller on Amazon EKS
We changed to NLB and now we are not getting 5XX errors.

504 errors from Elastic Load Balancer using Tomcat

I have a application running on multiple EC2 instances and served by Apache Tomcat. I've set up an AWS Elastic Load Balancer in front of the application, and everything basically works as expected. However, I will occasionally get a random 504 timeout error from the ELB. This doesn't seem to be related to load, as I've seen the errors under light load and heavy load. Also, it doesn't seem to occur in any regular pattern or situation.
Earlier in my testing, I was getting 504 errors because my application was taking longer to respond than the default 60 second timeout on ELB. I resolved that by bumping up the ELB timeout to the level necessary for my app. However, the 504 errors I'm getting now are happening very quickly. So, for example, one error I saw was on a request with a response time of about a second. It seems odd to be getting a timeout error when the request can't possibly have timed out on the application server.
This may be a similar issue to this question, though I couldn't quite tell from the information presented. Also, I don't have an additional load balancer in the mix, just ELB straight to Tomcat.
So, after some more digging, I've found the issue. This page was helpful in solving the mystery by explaining some details about idle and keepalive timeouts:
There are two immediate causes for receiving a 504 from an ELB:
The application actually took longer than the ELB's connection timeout to respond. This is a slow timeout — the 504 will typically be
returned after a number of seconds, with the default for an ELB being
60 seconds. In this case, it is necessary either to increase the ELB's
connection timeout, or improve application performance.
The application did not respond to the ELB at all, instead closing its connection when data was requested. This is a fast timeout — the
504 will typically be returned in a matter of milliseconds, well under
the ELB's timeout setting.
The first scenario was what I had seen and resolved by raising the ELB timeout. The second scenario describes the confusing behavior I was seeing after raising the ELB timeout. My log files had the "-1 -1 -1" pattern like the example logs from the article:
2015-12-11T13:42:07.736195Z my-elb 10.0.0.1:59893 - -1 -1 -1 504 0 0 0 "GET http://my-elb/ HTTP/1.1" "curl/7.19.7" - -
From the conclusion:
In short, an ELB's connection timeout must be set lower than both the
application's idle and keepalive timeouts to prevent spurious 504s
from being generated.
At some point during development before I started using ELB, I set the Tomcat timeout such that it happened to be higher than the default ELB timeout. When I bumped up the ELB timeout, I made it higher than the connectionTimeout I had set in Tomcat. Raising connectionTimeout to be slightly higher than my new ELB timeout got rid of the mystery 504 errors. So, I've now gotten rid of both the "slow" and "fast" timeout errors.
Tomcat also has a keepAliveTimeout setting which defaults to be the same as connectionTimeout if not set. I didn't have it set, so modifying connectionTimeout was enough to resolve my issue.
The ELB is not likely to be the cause of a problem, but instead showing that you have one. The 504 error is Gateway Timeout which occurs when the server (in this case Tomcat) does not respond quickly enough.
(I have been using ELBs for extremely high load services for many years, and do not agree with the answer to the link to other SO answer. While it is technically true, and may be true with extremely high bursting rates like thousands of requests in a second, unless your volume is this high, I would look at your application, first.)
The most obvious test to confirm it's not the ELB is to test requests directly against one of the Tomcat servers in your cluster. If you cannot route to the Tomcat instances, you could try curl to localhost from the instance you want to test.
Note also that there is a Health Check setting for ELBs and these allow you to set certain rules defining whether the server is healthy -- if not, the ELB will remove it from the cluster until it is healthy again. Health can include timely response. Look at CloudWatch for the ELB to see if there have been unhealthy instances recently.
If you were seeing 504 in development, and now it's more frequent, I would guess that this is actually a load or performance issue. The most typical is that Java gets into some garbage collection thrashing issue due to a problem with the underlying application. Look at CloudWatch metrics for your EC2 instances to see if memory or CPU is high or spikey.

Why would AWS ELB (Elastic Load Balancer) sometimes returns 504 (gateway timeout) right away?

ELB occasionally returns 504 to our clients right away (under 1 seconds).
Problem is, it's totally random, when we repeat the request right away, it works as it should be.
Anyone have same issue or any idea on this?
Does this answers for your quiestion:
Troubleshooting Elastic Load Balancing: HTTP Errors
HTTP 504: Gateway Timeout
Description: Indicates that the load balancer closed a connection because a request did not complete within the idle timeout period.
Cause: The application takes longer to respond than the configured idle timeout.
Solution: Monitor the HTTPCode_ELB_5XX and Latency CloudWatch metrics. If there is an increase in these metrics, it could be due to the application not responding within the idle timeout period. For details about the requests that are timing out, enable access logs on the load balancer and review the 504 response codes in the logs that are generated by Elastic Load Balancing. If necessary, you can increase your back-end capacity or increase the configured idle timeout so that lengthy operations (such as uploading a large file) can complete.
Or this:
504 gateway timeout LB and EC2

408 timeouts on my Amazon ELB

We are seeing a lot of 408 timeouts on our ELB access logs. Have come across this thread https://serverfault.com/questions/485063/getting-408-errors-on-our-logs-with-no-request-or-user-agent
and also https://forums.aws.amazon.com/thread.jspa?messageID=307846
These are just two sample threads I found but others suggest the same solutions with no joy.
Have set web server timeout to be < ELB idle timeout, to be = to it and to be > than it, same result, our logs are polluted with these 408s. A bigger problem though it that they also throw off the average latency response time of our ELB which is what we trigger our auto scaler with.
We use Tomcat on our back end instances. No logs appear on tomcat to indicate a request was recieved but the ELB still shows as if requests had timed out.
On our ELB access logs there is no back end IP given for the 408s so in my opinon the requests never got to an instance at all but Amazon disagree :(.
Any one had this problem and got a reliable solution for it?
Following the suggestion of milsonspt in the linked thread, I added a virtual host to my server that monitors a different thread instead of 80 so all health checks will be executed on that host (replace CUSTOM_PORT with any port you want to be used for ELB health check).
Listen CUSTOM_PORT
<VirtualHost *:CUSTOM_PORT>
CustomLog "|/usr/sbin/rotatelogs /var/log/httpd/access_log_elb_health_check_rotated_%Y-%m-%d-%H_%M_%S 10M" combined
</VirtualHost>
Make sure that the ELB does NOT have a listener on that port.
That configuration removed the 408 errors and logs all health check in a separate log so you get an uncluttered log for your regular access log, and a dedicated log for health check.
This could happen when ELB is waiting on the client for complete request. If a partial request comes in, with incomplete headers AWS ELB would just wait. AWS ELB would not do anything with the partial request headers, and eventually respond with 408 req_timeout due to idle timeout expiry on the tcp connection.