We are using CodeDeploy to load code onto our instances as they boot up. Our intention was that they would not be added to the LB prior to the code being loaded. To do this, we set a health check which looked for one of the files being deployed. What we have found is that some times instances without code are created (I assume code deploy failed) and these instances are staying in the LB even when marked unhealthy? How is this possible? Is this related to the grace period? Shouldn't instances that are unhealthy be removed automatically?
I believe I have found a large part of my problem: My Auto-scale group was set to use EC2 health checks and not my ELB health check. This resulted in the instance not being terminated. The traffic may have continued to flow longer to this crippled instance due to the need the need for a very long unhealthy state before having traffic completely stopped.
Related
We have a private EC2 Linux instance running behind an ALB. There is only one instance running and no auto-scaling configured.
Sometimes ALB marks the instance as unhealthy for some reasons. This mostly happens when network traffic is high on the instance, which generally one or two hours. This behavior is unpredictable. So when try to access the web application which is deployed in the EC2 instance, we get 502 bad gateway. We reboot the EC2 instance and only then the issue is resolved.
Does an ALB perform a health check on a target group again after it marks it as unhealthy? Suppose an ALB marks the target group with one EC2 instance as unhealthy. ALB is configured to perform a health check every 30 seconds. Will it check for healthiness after 30 seconds after it marked as unhealthy on the same target group? Or will it look for new healthy instance?
I assume auto-scaling configuration may resolve this problem by setting AS group with 1 when an instance go unhealthy? Our AWS architect feels the Tomcat is creating memory leak when too many requests come at a time.Tomcat does not run in the EC2.
What is the way to troubleshoot this problem? I search for system logs and configured ALB access logs, but no clue is available.
In this link I see ALB routes requests to the unhealthy targets when no other healths target is available .
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
My question is will ALB perform health check on the target group again after it marks it as unhealthy?
Indeed even when marked as unhealthy, the ALB continues the health checking. You can configure a 'healthy threshold count', which indicates how many 'healthy' responses should be received before an unhealthy host is marked as healthy again.
According to the docs:
When the health checks exceed HealthyThresholdCount consecutive successes, the load balancer puts the target back in service.
If your health check interval is 60 seconds, and the healthy threshold count is 3, it takes a minimum of 3 minutes before an unhealthy host will be marked healthy again.
I have a backend server deployed on aws in a single EC2 instance via elastic beanstalk. The server has ip whitelisting and hence does not respond to ALB health checks, so all target groups always remain unhealthy.
According to the official AWS docs on health checks,
If a target group contains only unhealthy registered targets, the load balancer nodes route requests across its unhealthy targets.
This is what keeps my application running even though the ALB target groups are always unhealthy.
This changed last night and I faced an outage where all requests started getting rejected with 503s for reasons I'm not able to figure out. I was able to get things to work again by provisioning another EC2 instance by increasing minimum capacity of elastic beanstalk.
During the window of the outage, cloudwatch shows there is neither healthy nor unhealthy instances, though nothing actually changed as there was one EC2 instance running for past few months untouched.
In that gap, I can find metrics on TCP connections though:
I don't really understand what happened here, can someone explain what or how to debug this?
Is there any event that could cause an EC2 instance to be removed from an ELB in AWS?
We have 4 services in a load balancer that usually chug away doing their thing, being updated only very occasionally. We've had issues from time to time where some of the instances would choke and need to be restarted, but they'd still be in the load balancer, simply listed as OutOfService.
However, today, we checked and found only 1 instance listed in the load balancer (as in completely removed, not OutOfService). The other 3 were healthy and the health check URL was returning a 200 status code. There's only two of us with access to the account so it definitely wasn't done manually.
Is there anything that could've caused the instances to be removed from the load balancer?
In this specific instance, the issue was caused by our deployment process taking machines out of the load balancer when deploying (i.e. doing a blue-green deployment). When a deployment issue occurred, the nodes wouldn't be added back into the load balancer.
I have 2 machines running under an Elastic Beanstalk environment.
One of them is down since the last deployment.
I was hoping that the auto scaling configuration will initiate a new machine due to having a single machine available.
That didn't happen and I'm trying to figure out what's wrong with my auto scaling configuration:
The first thing I see is that your rules contradict each other. It says if the number of unhealthy hosts are above 0, add a single host. If they are below 2, remove a single host. That may explain why you aren't seeing anything happening with your trigger.
Scaling triggers are used to bring in, or reduce, EC2 instances in your Auto Scaling group. This would be useful to bring in an additional instance(s) to maintain the same amount of computational power for your application while you investigate what caused the bad instance to fail. But this will not replace the instance.
To setup your instances to terminate after a certain period of being unhealthy you can follow the documentation here.
By default ELB pings port 80 with TCP, this is what determines the "health" of the EC2 instance, along with the on host EC2 instance status check. You can specify a Application health check URL to setup a customized health check that your application returns. Check out the more detailed customization of Beanstalk ELBs here.
Is there any way to have either ELB or an EC2 auto-scaling group terminate (or reboot) unhealthy instances from ELB?
There are some specific database failure conditions in our front end which makes it turn unhealthy, so the ELB will stop routing traffic to it. That instance is also part of an auto-scaling group, which scales on the group's CPU Load. So, what ends up happening is that the instance no longer gets traffic from ELB, so it has no CPU load, and skews the group's CPU load, thus screwing up the scaling conditions.
Is there an "easy" way to somehow configure ELB or an autoscaling group to automatically terminate unhealthy instances from the group without actually having to write code to do the polling and terminating via the EC2 API?
If you set the autoscaling group's health check type to ELB then it will automatically retire any instances that fail the ELB health checks (ie doesn't respond in a timely manner to the URL configured)
As long as the configured health check properly reports than an instance is bad (which sounds like it is the case since you say ELB is marking the instance as unhealthy) this should work