AWS/EKS: Getting frequent 504 gateway timeout errors from ALB - amazon-web-services

I'm using EKS to deploy a service, with ingress running on top of alb-ingress-controller.
All in all I have about 10 replicas of a single pod, with a single service of type NodePort which forwards traffic to them. The replicas run on 10 nodes, established with eksctl, and spread across 3 availability zones.
The problem I'm seeing is very strange - inside the cluster, all the logs show that requests are being handled in less than 1s, mostly around 20-50 millis. I know this because I used linkerd to show the percentiles of request latencies, as well as the app logs themselves. However, the ALB logs/monitoring tell a very different story. I see a relatively high request latency (often approaching 20s or more), and often also 504 errors returned from the ELB (sometimes 2-3 every 5 minutes).
When trying to read the access logs for the ALB, I noticed that the 504 lines look like this:
https 2019-12-10T14:56:54.514487Z app/1d061b91-XXXXX-3e15/19297d73543adb87 52.207.101.56:41266 192.168.32.189:30246 0.000 -1 -1 504 - 747 308 "GET XXXXXXXX" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-1:750977848747:targetgroup/1d061b91-358e2837024de757a3d/e59bbbdb58407de3 "Root=1-5defb1fa-cbcdd248dd043b5bf1221ad8" "XXXX" "XXXX" 1 2019-12-10T14:55:54.514000Z "forward" "-" "-" "192.168.32.189:30246" "-"
Where the request processing time is 0 and the target processing time is -1, indicating the request never made it to the backend, and response was returned immediately.
I tried to play with the backend HTTP keepalive timeout (currently at 75s) and with the ALB idle time (currently at 60s) but nothing seems to change much for this behavior.
If anyone can point me to how to proceed and investigate this, or what the cause can be, I'd appreciate it very much.

We faced a similar type of issue with EKS and ALB combination. If the target response code says -1, there may be a chance that the request waiting queue is full on the target side. So the ALB will immediately drop the request.
Try to do an ab benchmark by skipping the ALB and directly send the request to the service or the private IP address. Doing this will help you to identify where the problem is.
For us, 1 out of 10 requests failed if we send traffic via ALB. We are not seeing failures if we directly send the request to the service.
AWS recommendation is to use NLB over the ALB. NLB gives more advantages and suitable for Kubernetes. There is a blog which explains this Using a Network Load Balancer with the NGINX Ingress Controller on Amazon EKS
We changed to NLB and now we are not getting 5XX errors.

Related

Getting 5xx error with AWS Application Load Balancer - fluctuating healthy and unhealthy target group

My web application on AWS EC2 + load balancer sometimes shows 500 errors. How do I know if the error is on the server side or the application side?
I am using Route 53 domain and ssl on my url. I set the ALB redirect requests on port 80 to 443, and forward requests on port 443 to the target group (the EC2). However, the target group is returning 5xx error code sometimes when handling the request. Please see the screenshots for the metrics and configurations for the ALB.
Target Group Metrics
Target Group Configuration
Load Balancer Metrics
Load Balancer Listeners
EC2 Metrics
Right now the web application is running unsteady, sometimes it returns a 502 or 503 service unavailable (seems like it's a connnection timeout).
I have set up the ALB idle timeout 4000 secs.
ALB configuration
The application is using Nuxt.js + PHP7.0 + MySQL + Apache 2.4.54.
I have set the Apache prefork worker Maxclient number as 1000, which should be enough to handle the requests on the application.
The EC2 is a t2.Large resource, the CPU and Memory look enough to handle the processing.
It seems like if I directly request the IP address but not the domain, the amount of 5xx errors significantly reduced (but still exists).
I also have Wordpress application host on this EC2 in a subdomain (CNAME). I have never encountered any 5xx errors on this subdomain site, which makes me guess there might be some errors in my application code but not on the server side.
Is the 5xx error from my application or from the server?
I also tried to add another EC2 in the target group see if they can have at lease one healthy instance to handle the requests. However, the application is using a third-party API and has strict IP whitelist policy. I did some research that the Elastic IP I got from AWS cannot be attached to 2 different EC2s.
First of all, if your application is prone to stutters, increase healthcheck retries and timeouts, which will affect your initial question of flapping health.
To what I see from your screenshot, most of your 5xx are due to either server or application (you know obviously better what's the culprit since you have access to their logs).
To answer your question about 5xx errors coming from LB: this happens directly after LB kicks out unhealthy instance and if there's none to replace (which shouldn't be the case because you're supposed to have ASG if you enable evaluation of target health for LB), it can't produce meaningful output and thus crumbles with 5xx.
This should be enough information for you to make adjustments and logs investigation.

Putting ALb-NLB-ALB route for requests is giving 502 for application

We had a primary ALB listening to all out apps mapped through R53 records. Now we have listener rule crunch as ALB doesn't support more rules above 100. So we had been proposed a solution where we can put a NLB under primary ALB and then secondary ALB under NLB.
So flow will be:
Requests--->R53--->ALB1--->NLB--->ALB2--->Apps
ALB1 has a default rule which allows unmatched requests to pass through to NLB and then ultimately to ALB2 where new rules are evaluated.
Rule configuration at ALB1 is:
Default rule --Forwardto-->
Rule at NLB:
TCP-443 listener rule --ForwardTo--> ALB2 TG with fargate application ip
But we're seeing intermittent 502 responses on primary ALB while testing. We are not seeing any 502 logging on ALB2. So possibly NLB is ending it as we have seen multiple TArget reset count happening at NLB in metrics.
Also, nothing is getting logged in application logs.
We did another testing where we directly routed traffic to ALB2 through R53, we didn't see any 502 responses there.
Any suggestion, how to go about the debugging it?
I think, I have the answer to my problem now, so sharing it for wider audience. The reason for intermittent 502s was the inconsistency of idle_timeout_value across the Lbs and backend application.
Since for NLB idle_timeout_value is set to 350 seconds by default, and can't be changed, we had inconsistent values across LBs. First ALB and last ALB had value 600 seconds.
Ideally application should have highest idle_timeout_value followed by LBs in hierarchy.
So setting up value of first ALB to 300 seconds and second ALB to 500 seconds solved this problem. And we haven't got a single 500 code post this implementation.

502 - failed_to_connect_to_backend from LB on HEALTHY target group

I have some global Load balancer on the GCP. This balancer should send requests to the instance group with two back services.
And when I try to send some requests, I randomly get 502 errors failed_to_connect_to_backend from my load balancer.
I can get a successful answer seven times, one by one, and then 2-3 times 502 error for the same request.
In the
Monitoring Dashboard I see this - my both services are healthy.
The Instanse groups overwiev shows 100% healthy status too.
URL map rules is default default
I also don`t see any problems with resource consumption
And, unfortunately, I couldn't get any logs from the back-end side for the 502 errors, have only logs from the Load Balancer
After hours of coffee and liters of manuals (I'm not very well versed in GCP yet) the "problem" is solved - at some point I noticed that the execution time of all failed requests is ~ 9 seconds.
Therefore, I tried to search for results with similar symptoms, as a result I found a answer on the Google Groups
In my case - we have trobles with port mapping(was used two ports in mapping - like 80, 6000. And 80 - was not listening from the backend side)
After removing closed - 502s gone away.
If port 80 was not allowed on the firewall rule applied on the backend instances?

504 errors from Elastic Load Balancer using Tomcat

I have a application running on multiple EC2 instances and served by Apache Tomcat. I've set up an AWS Elastic Load Balancer in front of the application, and everything basically works as expected. However, I will occasionally get a random 504 timeout error from the ELB. This doesn't seem to be related to load, as I've seen the errors under light load and heavy load. Also, it doesn't seem to occur in any regular pattern or situation.
Earlier in my testing, I was getting 504 errors because my application was taking longer to respond than the default 60 second timeout on ELB. I resolved that by bumping up the ELB timeout to the level necessary for my app. However, the 504 errors I'm getting now are happening very quickly. So, for example, one error I saw was on a request with a response time of about a second. It seems odd to be getting a timeout error when the request can't possibly have timed out on the application server.
This may be a similar issue to this question, though I couldn't quite tell from the information presented. Also, I don't have an additional load balancer in the mix, just ELB straight to Tomcat.
So, after some more digging, I've found the issue. This page was helpful in solving the mystery by explaining some details about idle and keepalive timeouts:
There are two immediate causes for receiving a 504 from an ELB:
The application actually took longer than the ELB's connection timeout to respond. This is a slow timeout — the 504 will typically be
returned after a number of seconds, with the default for an ELB being
60 seconds. In this case, it is necessary either to increase the ELB's
connection timeout, or improve application performance.
The application did not respond to the ELB at all, instead closing its connection when data was requested. This is a fast timeout — the
504 will typically be returned in a matter of milliseconds, well under
the ELB's timeout setting.
The first scenario was what I had seen and resolved by raising the ELB timeout. The second scenario describes the confusing behavior I was seeing after raising the ELB timeout. My log files had the "-1 -1 -1" pattern like the example logs from the article:
2015-12-11T13:42:07.736195Z my-elb 10.0.0.1:59893 - -1 -1 -1 504 0 0 0 "GET http://my-elb/ HTTP/1.1" "curl/7.19.7" - -
From the conclusion:
In short, an ELB's connection timeout must be set lower than both the
application's idle and keepalive timeouts to prevent spurious 504s
from being generated.
At some point during development before I started using ELB, I set the Tomcat timeout such that it happened to be higher than the default ELB timeout. When I bumped up the ELB timeout, I made it higher than the connectionTimeout I had set in Tomcat. Raising connectionTimeout to be slightly higher than my new ELB timeout got rid of the mystery 504 errors. So, I've now gotten rid of both the "slow" and "fast" timeout errors.
Tomcat also has a keepAliveTimeout setting which defaults to be the same as connectionTimeout if not set. I didn't have it set, so modifying connectionTimeout was enough to resolve my issue.
The ELB is not likely to be the cause of a problem, but instead showing that you have one. The 504 error is Gateway Timeout which occurs when the server (in this case Tomcat) does not respond quickly enough.
(I have been using ELBs for extremely high load services for many years, and do not agree with the answer to the link to other SO answer. While it is technically true, and may be true with extremely high bursting rates like thousands of requests in a second, unless your volume is this high, I would look at your application, first.)
The most obvious test to confirm it's not the ELB is to test requests directly against one of the Tomcat servers in your cluster. If you cannot route to the Tomcat instances, you could try curl to localhost from the instance you want to test.
Note also that there is a Health Check setting for ELBs and these allow you to set certain rules defining whether the server is healthy -- if not, the ELB will remove it from the cluster until it is healthy again. Health can include timely response. Look at CloudWatch for the ELB to see if there have been unhealthy instances recently.
If you were seeing 504 in development, and now it's more frequent, I would guess that this is actually a load or performance issue. The most typical is that Java gets into some garbage collection thrashing issue due to a problem with the underlying application. Look at CloudWatch metrics for your EC2 instances to see if memory or CPU is high or spikey.

HTTP 504 errors returned by ELB even when hosts are healthy and able to serve request

I have a service which is deployed on Amazon Web Services (AWS), specifically 2 instances behind an Elastic Load Balancer (ELB). Availability zones are selected as all three us-west-2a,b,c
but only 2 of the above 3 zones have instances running in it.
The issues is that even though the traffic/load is not too high but I still get HTTP 504 errors from ELB often enough.
The log lines reads like this
-1 -1 -1 504 0 0 0
In order, --request_processing_time --backend_processing_time --response_processing_time --elb_status_code --backend_status_code --received_bytes --sent_bytes.
Description of what each field and response means can be found here
ELB idle timeout is 60 seconds. KeepAlive is 'On' on backend instances. Latency of requests from ELB are in check. I have tried increasing KeepAliveTimeout but to no avail.
Does anyone have any idea about how to proceed? I don't even know the root cause of this issue.
PS: More like a second question, there are a few cases (much less than 504 being returned by ELB when backend does not even accept the request) where even backend is returning a 504 and then ELB is forwarding the same to client. To the best of my knowledge HTTP 504 should be returned by a proxy only when backend is timing out. How can a server itself return a 504?
So that it might assist others in future, I am publishing my finding(s) here:
1) This 504 0 HTTP errors were mainly because of logrotate reloading apache instead of graceful restart.
The current AWS config does the following
/sbin/service httpd reload > /dev/null 2>/dev/null || true
so replace the service command with either apachectl -k graceful or /sbin/service httpd graceful
File location on my ec2 instance: /etc/logrotate.elasticbeanstalk.hourly/logrotate.elasticbeanstalk.httpd.conf
2) Because logrotate frequency was too high by default in AWS (once every hour), at least for my use case, and that in turn was reloading apache every hour, so I reduced that as well.
When backend connection timeout, ELB will put -1 to backend_processing_time column in its access log. Think what's happening is that some of your requests take longer than 60s for your backend to process. To confirm this, can you check your latency metrics? Please switch to maximum when viewing this metric. It will confirm my guess if you see latency frequently reaches 60s.
After it got confirmed, you might want to increase Idle timeout of your ELB and backend.