Why would AWS ELB (Elastic Load Balancer) sometimes returns 504 (gateway timeout) right away? - amazon-web-services

ELB occasionally returns 504 to our clients right away (under 1 seconds).
Problem is, it's totally random, when we repeat the request right away, it works as it should be.
Anyone have same issue or any idea on this?

Does this answers for your quiestion:
Troubleshooting Elastic Load Balancing: HTTP Errors
HTTP 504: Gateway Timeout
Description: Indicates that the load balancer closed a connection because a request did not complete within the idle timeout period.
Cause: The application takes longer to respond than the configured idle timeout.
Solution: Monitor the HTTPCode_ELB_5XX and Latency CloudWatch metrics. If there is an increase in these metrics, it could be due to the application not responding within the idle timeout period. For details about the requests that are timing out, enable access logs on the load balancer and review the 504 response codes in the logs that are generated by Elastic Load Balancing. If necessary, you can increase your back-end capacity or increase the configured idle timeout so that lengthy operations (such as uploading a large file) can complete.
Or this:
504 gateway timeout LB and EC2

Related

Getting 5xx error with AWS Application Load Balancer - fluctuating healthy and unhealthy target group

My web application on AWS EC2 + load balancer sometimes shows 500 errors. How do I know if the error is on the server side or the application side?
I am using Route 53 domain and ssl on my url. I set the ALB redirect requests on port 80 to 443, and forward requests on port 443 to the target group (the EC2). However, the target group is returning 5xx error code sometimes when handling the request. Please see the screenshots for the metrics and configurations for the ALB.
Target Group Metrics
Target Group Configuration
Load Balancer Metrics
Load Balancer Listeners
EC2 Metrics
Right now the web application is running unsteady, sometimes it returns a 502 or 503 service unavailable (seems like it's a connnection timeout).
I have set up the ALB idle timeout 4000 secs.
ALB configuration
The application is using Nuxt.js + PHP7.0 + MySQL + Apache 2.4.54.
I have set the Apache prefork worker Maxclient number as 1000, which should be enough to handle the requests on the application.
The EC2 is a t2.Large resource, the CPU and Memory look enough to handle the processing.
It seems like if I directly request the IP address but not the domain, the amount of 5xx errors significantly reduced (but still exists).
I also have Wordpress application host on this EC2 in a subdomain (CNAME). I have never encountered any 5xx errors on this subdomain site, which makes me guess there might be some errors in my application code but not on the server side.
Is the 5xx error from my application or from the server?
I also tried to add another EC2 in the target group see if they can have at lease one healthy instance to handle the requests. However, the application is using a third-party API and has strict IP whitelist policy. I did some research that the Elastic IP I got from AWS cannot be attached to 2 different EC2s.
First of all, if your application is prone to stutters, increase healthcheck retries and timeouts, which will affect your initial question of flapping health.
To what I see from your screenshot, most of your 5xx are due to either server or application (you know obviously better what's the culprit since you have access to their logs).
To answer your question about 5xx errors coming from LB: this happens directly after LB kicks out unhealthy instance and if there's none to replace (which shouldn't be the case because you're supposed to have ASG if you enable evaluation of target health for LB), it can't produce meaningful output and thus crumbles with 5xx.
This should be enough information for you to make adjustments and logs investigation.

How to monitor fargate ECS web app timeouts in CloudWatch?

I have a simple setup: Fargate ECS cluster with ALB, running web API.
I want to monitor (and ring alarms) the number of requests timed out for my web app. The only metric close to that I found in CloudWatch is AWS/ApplicationELB -> TargetResponseTime
But, it seems like requests that timed out from the ALB point of view are not recorded there at all.
How do you monitor ALB timeouts?
This answer is only from ALB time out requests point of view.
It is confusing because there is not a specific metric which is termed or contains timeout.
ALB Timeout generates an HTTP 408 error code for which ALB internally increments the HTTPCode_ELB_4XX_Count.
From the Docs
The load balancer sends the HTTP code to the client, saves the request to the access log, and increments the HTTPCode_ELB_4XX_Count or HTTPCode_ELB_5XX_Count metric.
In my view you can set up a CloudWatch alarm to monitor HTTPCode_ELB_4XX_Countmetric and initiate an action (such as sending a notification to an email address) if the metric goes outside what you consider an acceptable range.
More details about the HTTPCode_ELB_4XX_Count -> https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html

502 errors are due to healthcheck setup or resource exhaustion

My setup is a bitnami wordpress hosted on GCP's N2-standard-2 VM. I'm using a HTTPS load balancer and CDN.
I encountered the 502 errors a few times ever since I configured a load balancer. I was doing quite a bit of seo and page scanning tests when this happened.
I've checked that the VM is only using 8-12% of the disk capacity. The log shows CPU Max usage is 9.62%. I've to restart the VM to resolve the error.
What are the cause of the 502 errors
Could it be due to the traffic spike from third party scanning sites?
Is it because of my health check configuration?
Do I have to change a machine type and increase the memory?
What should I look into to troubleshoot it?
This is my healthcheck setup
This is my healthcheck setup
The server was down again and this time round I managed to look for the information you have suggested.
The error is not from Load Balancer
The error is from VM and the error message is:
"Error watching metadata: Get http://169.254.169.254/computeMetadata/v1//?recursive=true&alt=json&wait_for_change=true&timeout_sec=60&last_etag=ag92d16ff423b06: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
VM disk size is 100GB. Machine Type is N2-standard-2 VM
It is a Wordpress Instance
Everything is within Quota
Incidents happen on a few occasions:
when I use third party site to scan the website for deadlinks. After the scan is completed, the server will go down shortly after. I have to reboot the instance to make it functional again.
It happens randomly and recover by itself after a while
Thanks everyone for your help. I just managed to figure out how to retrieve the other required info.
I was wrong that the load balancer didn't report any errors.
Below is from Logging
From Loadbalancer : Client disconnected before any response
From Loadbalancer: 502 - failed_to_pick_backend
From Unmanaged Instance Group: Timeout waiting for data and HTTP Response Internal server error
I tried to increase the Load Balancer timeout duration, the VM stills shut down and rebooted on its own. Sometimes it takes a few minutes to recover and sometimes it takes about an hour plus.
I provided some screenshots which recorded the recent incident from 8.47 to 8.54.
Below is from Monitoring

Google Load Balancer retrying the request on web server after timeout expires

We are using Google Load Balancer with Tomcat Server. We have kept a specific timeout on load balancer from cloud console setting portal. Whenever any request takes more than the timeout time, GLB returns 502 i.e expected.
Here the problem is -
Whenever the request takes more than the given time, on tomcat side we are getting the same request again exactly after the timeout e.g when we have the timeout as 30 sec we got the same request on tomcat exactly after 30 sec.
On browser the response time for 502 is exactly twice of the timeout time. (It might be because of network turn around time but why always exact twice)
I assume you are referring to HTTP(S) load balancing. In this scenario, a reverse proxy is sitting in front of your application, handling requests and forwarding them to your backends. This proxy (a GFE) will retry as documented:
HTTP(S) load balancing retries failed GET requests in certain circumstances, such as when the response timeout is exhausted. It does not retry failed POST requests. Retries are limited to two attempts. Retried requests only generate one log entry for the final response. Refer to Logging for more information.

504 errors from Elastic Load Balancer using Tomcat

I have a application running on multiple EC2 instances and served by Apache Tomcat. I've set up an AWS Elastic Load Balancer in front of the application, and everything basically works as expected. However, I will occasionally get a random 504 timeout error from the ELB. This doesn't seem to be related to load, as I've seen the errors under light load and heavy load. Also, it doesn't seem to occur in any regular pattern or situation.
Earlier in my testing, I was getting 504 errors because my application was taking longer to respond than the default 60 second timeout on ELB. I resolved that by bumping up the ELB timeout to the level necessary for my app. However, the 504 errors I'm getting now are happening very quickly. So, for example, one error I saw was on a request with a response time of about a second. It seems odd to be getting a timeout error when the request can't possibly have timed out on the application server.
This may be a similar issue to this question, though I couldn't quite tell from the information presented. Also, I don't have an additional load balancer in the mix, just ELB straight to Tomcat.
So, after some more digging, I've found the issue. This page was helpful in solving the mystery by explaining some details about idle and keepalive timeouts:
There are two immediate causes for receiving a 504 from an ELB:
The application actually took longer than the ELB's connection timeout to respond. This is a slow timeout — the 504 will typically be
returned after a number of seconds, with the default for an ELB being
60 seconds. In this case, it is necessary either to increase the ELB's
connection timeout, or improve application performance.
The application did not respond to the ELB at all, instead closing its connection when data was requested. This is a fast timeout — the
504 will typically be returned in a matter of milliseconds, well under
the ELB's timeout setting.
The first scenario was what I had seen and resolved by raising the ELB timeout. The second scenario describes the confusing behavior I was seeing after raising the ELB timeout. My log files had the "-1 -1 -1" pattern like the example logs from the article:
2015-12-11T13:42:07.736195Z my-elb 10.0.0.1:59893 - -1 -1 -1 504 0 0 0 "GET http://my-elb/ HTTP/1.1" "curl/7.19.7" - -
From the conclusion:
In short, an ELB's connection timeout must be set lower than both the
application's idle and keepalive timeouts to prevent spurious 504s
from being generated.
At some point during development before I started using ELB, I set the Tomcat timeout such that it happened to be higher than the default ELB timeout. When I bumped up the ELB timeout, I made it higher than the connectionTimeout I had set in Tomcat. Raising connectionTimeout to be slightly higher than my new ELB timeout got rid of the mystery 504 errors. So, I've now gotten rid of both the "slow" and "fast" timeout errors.
Tomcat also has a keepAliveTimeout setting which defaults to be the same as connectionTimeout if not set. I didn't have it set, so modifying connectionTimeout was enough to resolve my issue.
The ELB is not likely to be the cause of a problem, but instead showing that you have one. The 504 error is Gateway Timeout which occurs when the server (in this case Tomcat) does not respond quickly enough.
(I have been using ELBs for extremely high load services for many years, and do not agree with the answer to the link to other SO answer. While it is technically true, and may be true with extremely high bursting rates like thousands of requests in a second, unless your volume is this high, I would look at your application, first.)
The most obvious test to confirm it's not the ELB is to test requests directly against one of the Tomcat servers in your cluster. If you cannot route to the Tomcat instances, you could try curl to localhost from the instance you want to test.
Note also that there is a Health Check setting for ELBs and these allow you to set certain rules defining whether the server is healthy -- if not, the ELB will remove it from the cluster until it is healthy again. Health can include timely response. Look at CloudWatch for the ELB to see if there have been unhealthy instances recently.
If you were seeing 504 in development, and now it's more frequent, I would guess that this is actually a load or performance issue. The most typical is that Java gets into some garbage collection thrashing issue due to a problem with the underlying application. Look at CloudWatch metrics for your EC2 instances to see if memory or CPU is high or spikey.