Why is the ALB Failing Health Checks on a Healthy Target?

Why is the ALB Failing Health Checks on a Healthy Target? - amazon-web-services

I spent quite a lot of time trying to debug this, so I thought I'd post it in case anyone else had the same issue. I was trying to debug an ALB health check issue with Fargate. I could manually connect and see that everything was coming through. I could even connect to the Fargate instance and see that the health checks were coming through and being responded to appropriately. But the ALB kept reporting the health checks as failing.
In this specific case, I am using Tomcat as the server and Fargate as the destination, and the specific error message is "request timed out", but I think other setups (and even other error messages) conform to this case.

It turned out that the only problem was, on my service, I needed to GREATLY increase HealthCheckGracePeriodSeconds. This is the amount of time that the Load Balancer will wait before it starts counting health checks against you.
It turns out that there is quite a bit of latency between what the Load Balancer is doing and what it is reporting to you. By the time I was getting the "request timed out" error, the load balancer had already decided that my machine was failing health checks, but hadn't removed it from the pool yet. So, for me, it looked like it was running correctly, and the Load Balancer was still sending health checks even though it had made the decision to remove the machine from the pool. It's just the latency between the decisions being made on the Load Balancer, when it implemented those decisions, as well as when it reports them, caused quite a bit of confusion on my end.
So, if you are having problems with targets being added to your load balancer (in my case, it was a Tomcat server), especially on Fargate, be sure to check HealthCheckGracePeriodSeconds to be sure that you are giving it enough time to start all the way up. You can set it to a ridiculously high value to make sure (I think it can go up to 67 years).

Related

Shutdown scripts for Google managed instance groups

I am running managed instance group in google cloud. These are behind a loadbalancer and it is working fine. The problem is when the managed instance group scales down, the loadbalancer will not notice this until after the instance has been killed so some requests will be sent to an instance that is dead causing the application to not work properly for a while.
On this page https://cloud.google.com/compute/docs/autoscaler/understanding-autoscaler-decisions I have read that shutdown scripts can be used. I tried to add one that tells the instance it will be shut down so it starts sending unhealthy when the load balancer does a health check, the script then waits for a while to make sure to give the load balancer time to check it. It does however not seem to work. The script seems to be called but to late so it just shuts down.
Anyone know how to write a shutdown script for this scenario?

Seems like this was not the problem after all. After inspecting the logs it turned out that the health checks timed out causing the load balancer to ot find any nodes with a healthy state.

Google cloud load balancer causing error 502 - failed_to_pick_backend

I've got an error 502 when I use google cloud balancer with CDN, the thing is, I am pretty sure I must have done something wrong setting up the load balancer because when I remove the load balancer, my website runs just fine.
This is how I configure my load balancer
here
Should I use HTTP or HTTPS healthcheck, because when I set up HTTPS
healthcheck, my website was up for a bit and then it down again
I have checked this link, they seem to have the same problem but it is not working for me.
I have followed a tutorial from openlitespeed forum to set Keep-Alive Timeout (secs) = 60s in server admin panel and configure instance to accepts long-lived connections ,still not working for me.
I have added these 2 firewall rules following this google cloud link to allow google health check ip but still didn’t work:
https://cloud.google.com/load-balancing/docs/health-checks#fw-netlb
https://cloud.google.com/load-balancing/docs/https/ext-http-lb-simple#firewall
When checking load balancer log message, it shows an error saying failed_to_pick_backend . I have tried to re-configure load balancer but it didn't help.
I just started to learn Google Cloud and my knowledge is really limited, it would be greatly appreciated if someone could show me step by step how to solve this issue. Thank you!

Posting an answer - based on OP's finding to improve user experience.
Solution to the error 502 - failed_to_pick_backend was changing Load Balancer from HTTP to TCP protocol and at the same type changing health check from HTTP to TCP also.
After that LB passes through all incoming connections as it should and the error dissapeared.
Here's some more info about various types of health checks and how to chose correct one.

The error message that you're facing it's "failed_to_pick_backend".
This error message means that HTTP responses code are generated when a GFE was not able to establish a connection to a backend instance or was not able to identify a viable backend instance to connect to
I noticed in the image that your health-check failed causing the aforementioned error messages, this Health Check failing behavior could be due to:
Web server software not running on backend instance
Web server software misconfigured on backend instance
Server resources exhausted and not accepting connections:
- CPU usage too high to respond
- Memory usage too high, process killed or can't malloc()
- Maximum amount of workers spawned and all are busy (think mpm_prefork in Apache)
- Maximum established TCP connections
Check if the running services were responding with a 200 (OK) to the Health Check probes and Verify your Backend Service timeout. The Backend Service timeout works together with the configured Health Check values to define the amount of time an instance has to respond before being considered unhealthy.
Additionally, You can see this troubleshooting guide to face some error messages (Including this).

Those experienced with Kubernetes from other platforms may be confused as to why their Ingresses are calling their backends "UNHEALTHY".
Health checks are not the same thing as Readiness Probes and Liveness Probes.
Health checks are an independent utility used by GCP's Load Balancers and perform the exact same function, but are defined elsewhere. Failures here will lead to 502 errors.
https://console.cloud.google.com/compute/healthChecks

Fixing AWS Application Load Balancer connection timeouts under small loads

I'm currently using an application load balancer to split traffic between 3 instance.
Testing the individual instance (connection via IP), they are able to handle about ~200req/second without any connection timeout (timeout being set a 5 seconds).
As such, I'd assume that load balancing over all 3 of them would scale this up to ~600req/second (there's no bottleneck further down the pipe to stop this).
However, when sending the exact same type of test requests to an application load balancer, the connections start to time out before I even hit 100req/second.
I already eliminated the possibly of a low down due to HTTPS (by just sending the requests over HTTP), the instance themselves are healthy and not under heavy use an the load balancer reports no "errors".
I've also configured IP stickiness for ~20 minutes to try and improve the situation but it hasn't helped one bit.
What could be the cause of this problem ? I found no information about increasing the network capacity of LB on aws and no similar questions, so I'm bound to be doing something wrong, but I'm quite unsure what that something is.

504 errors from Elastic Load Balancer using Tomcat

I have a application running on multiple EC2 instances and served by Apache Tomcat. I've set up an AWS Elastic Load Balancer in front of the application, and everything basically works as expected. However, I will occasionally get a random 504 timeout error from the ELB. This doesn't seem to be related to load, as I've seen the errors under light load and heavy load. Also, it doesn't seem to occur in any regular pattern or situation.
Earlier in my testing, I was getting 504 errors because my application was taking longer to respond than the default 60 second timeout on ELB. I resolved that by bumping up the ELB timeout to the level necessary for my app. However, the 504 errors I'm getting now are happening very quickly. So, for example, one error I saw was on a request with a response time of about a second. It seems odd to be getting a timeout error when the request can't possibly have timed out on the application server.
This may be a similar issue to this question, though I couldn't quite tell from the information presented. Also, I don't have an additional load balancer in the mix, just ELB straight to Tomcat.

So, after some more digging, I've found the issue. This page was helpful in solving the mystery by explaining some details about idle and keepalive timeouts:
There are two immediate causes for receiving a 504 from an ELB:
The application actually took longer than the ELB's connection timeout to respond. This is a slow timeout — the 504 will typically be
returned after a number of seconds, with the default for an ELB being
60 seconds. In this case, it is necessary either to increase the ELB's
connection timeout, or improve application performance.
The application did not respond to the ELB at all, instead closing its connection when data was requested. This is a fast timeout — the
504 will typically be returned in a matter of milliseconds, well under
the ELB's timeout setting.
The first scenario was what I had seen and resolved by raising the ELB timeout. The second scenario describes the confusing behavior I was seeing after raising the ELB timeout. My log files had the "-1 -1 -1" pattern like the example logs from the article:
2015-12-11T13:42:07.736195Z my-elb 10.0.0.1:59893 - -1 -1 -1 504 0 0 0 "GET http://my-elb/ HTTP/1.1" "curl/7.19.7" - -
From the conclusion:
In short, an ELB's connection timeout must be set lower than both the
application's idle and keepalive timeouts to prevent spurious 504s
from being generated.
At some point during development before I started using ELB, I set the Tomcat timeout such that it happened to be higher than the default ELB timeout. When I bumped up the ELB timeout, I made it higher than the connectionTimeout I had set in Tomcat. Raising connectionTimeout to be slightly higher than my new ELB timeout got rid of the mystery 504 errors. So, I've now gotten rid of both the "slow" and "fast" timeout errors.
Tomcat also has a keepAliveTimeout setting which defaults to be the same as connectionTimeout if not set. I didn't have it set, so modifying connectionTimeout was enough to resolve my issue.

The ELB is not likely to be the cause of a problem, but instead showing that you have one. The 504 error is Gateway Timeout which occurs when the server (in this case Tomcat) does not respond quickly enough.
(I have been using ELBs for extremely high load services for many years, and do not agree with the answer to the link to other SO answer. While it is technically true, and may be true with extremely high bursting rates like thousands of requests in a second, unless your volume is this high, I would look at your application, first.)
The most obvious test to confirm it's not the ELB is to test requests directly against one of the Tomcat servers in your cluster. If you cannot route to the Tomcat instances, you could try curl to localhost from the instance you want to test.
Note also that there is a Health Check setting for ELBs and these allow you to set certain rules defining whether the server is healthy -- if not, the ELB will remove it from the cluster until it is healthy again. Health can include timely response. Look at CloudWatch for the ELB to see if there have been unhealthy instances recently.
If you were seeing 504 in development, and now it's more frequent, I would guess that this is actually a load or performance issue. The most typical is that Java gets into some garbage collection thrashing issue due to a problem with the underlying application. Look at CloudWatch metrics for your EC2 instances to see if memory or CPU is high or spikey.

How can I configure an automatic timeout for an Elastic Load Balancer?

Does anyone know of a way to make Amazon's Elastic Load Balancers timeout if an HTTP response has not been received from upstream in a set timeframe?
Occasionally Amazon's Elastic Beanstalk will fail an update and any requests to the specified resource (running Nginx + Node if tht's any use) will hang any request pages whilst the resource attempts to load.
I'd like to keep the request timeout under 2s, and if the upstream server has no response by then, to automatically fail over to a default 503 response.
Is this possible with ELB?
Cheers

You can Configure Health Check Settings for Elastic Load Balancing to achieve this:
Elastic Load Balancing routinely checks the health of each registered Amazon EC2 instance based on the configurations that you specify. If Elastic Load Balancing finds an unhealthy instance, it stops sending traffic to the instance and reroutes traffic to healthy instances. For more information on configuring health check, see Health Check.
For example, you simply need to specify an appropriate Ping Path for the HTTP health check, a Response Timeout of 2 seconds and an UnhealthyThreshold of 1 to approximate your specification.
See my answer to What does the Amazon ELB automatic health check do and what does it expect? for more details on how the ELB health check system work.

TLDR - Set your timeout in Nginx.
Let's see if we can walkthrough the issues.
Problem:
The client should be presented with something quickly. It's okay if it's a 500 page. However, the ELB currently waits 60 seconds until giving up (https://forums.aws.amazon.com/thread.jspa?messageID=382182) which means it takes a minute before the user is shown anything.
Solutions:
Change the timeout of the ELB
Looks like AWS support will help increase the timeout (https://forums.aws.amazon.com/thread.jspa?messageID=382182) so I imagine that you'll be able to ask for the reverse. Thus, we can see that it's not user/api tunable and requires you to interact with support. This takes a bit of lead time and more importantly, seems like an odd dial to tune when future developers working on this project will be surprised by such a short timeout.
Change the timeout of the nginx server
This seems like the right level of change. You can use proxy_read_timeout (http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_read_timeout) to do what you're looking for. Tune it to something small (and in particular, you can set it for a particular location if you would like).
Change the way the request happens.
It may be beneficial to change how your client code works. You could imagine shipping a really simple html/js page that 1. pings to see if the job is done and 2. keeps the user updated on the progress. This takes a bit more work then just throwing the 500 page.

Recently, AWS added a way to configure timeouts for ELB. See this blog post:
http://aws.amazon.com/blogs/aws/elb-idle-timeout-control/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js