Is there a way to automatically terminate unhealthy EC2 instances from ELB? - amazon-web-services

Is there any way to have either ELB or an EC2 auto-scaling group terminate (or reboot) unhealthy instances from ELB?
There are some specific database failure conditions in our front end which makes it turn unhealthy, so the ELB will stop routing traffic to it. That instance is also part of an auto-scaling group, which scales on the group's CPU Load. So, what ends up happening is that the instance no longer gets traffic from ELB, so it has no CPU load, and skews the group's CPU load, thus screwing up the scaling conditions.
Is there an "easy" way to somehow configure ELB or an autoscaling group to automatically terminate unhealthy instances from the group without actually having to write code to do the polling and terminating via the EC2 API?

If you set the autoscaling group's health check type to ELB then it will automatically retire any instances that fail the ELB health checks (ie doesn't respond in a timely manner to the URL configured)
As long as the configured health check properly reports than an instance is bad (which sounds like it is the case since you say ELB is marking the instance as unhealthy) this should work

Related

Does an Application Load Balancer does automatic health check on an unhealthy instance?

We have a private EC2 Linux instance running behind an ALB. There is only one instance running and no auto-scaling configured.
Sometimes ALB marks the instance as unhealthy for some reasons. This mostly happens when network traffic is high on the instance, which generally one or two hours. This behavior is unpredictable. So when try to access the web application which is deployed in the EC2 instance, we get 502 bad gateway. We reboot the EC2 instance and only then the issue is resolved.
Does an ALB perform a health check on a target group again after it marks it as unhealthy? Suppose an ALB marks the target group with one EC2 instance as unhealthy. ALB is configured to perform a health check every 30 seconds. Will it check for healthiness after 30 seconds after it marked as unhealthy on the same target group? Or will it look for new healthy instance?
I assume auto-scaling configuration may resolve this problem by setting AS group with 1 when an instance go unhealthy? Our AWS architect feels the Tomcat is creating memory leak when too many requests come at a time.Tomcat does not run in the EC2.
What is the way to troubleshoot this problem? I search for system logs and configured ALB access logs, but no clue is available.
In this link I see ALB routes requests to the unhealthy targets when no other healths target is available .
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
My question is will ALB perform health check on the target group again after it marks it as unhealthy?
Indeed even when marked as unhealthy, the ALB continues the health checking. You can configure a 'healthy threshold count', which indicates how many 'healthy' responses should be received before an unhealthy host is marked as healthy again.
According to the docs:
When the health checks exceed HealthyThresholdCount consecutive successes, the load balancer puts the target back in service.
If your health check interval is 60 seconds, and the healthy threshold count is 3, it takes a minimum of 3 minutes before an unhealthy host will be marked healthy again.

AWS target groups turn unhealthy with no data

I have a backend server deployed on aws in a single EC2 instance via elastic beanstalk. The server has ip whitelisting and hence does not respond to ALB health checks, so all target groups always remain unhealthy.
According to the official AWS docs on health checks,
If a target group contains only unhealthy registered targets, the load balancer nodes route requests across its unhealthy targets.
This is what keeps my application running even though the ALB target groups are always unhealthy.
This changed last night and I faced an outage where all requests started getting rejected with 503s for reasons I'm not able to figure out. I was able to get things to work again by provisioning another EC2 instance by increasing minimum capacity of elastic beanstalk.
During the window of the outage, cloudwatch shows there is neither healthy nor unhealthy instances, though nothing actually changed as there was one EC2 instance running for past few months untouched.
In that gap, I can find metrics on TCP connections though:
I don't really understand what happened here, can someone explain what or how to debug this?

AWS Beanstalk, how to reboot (or terminate) automatically an instance that is not responding

I have my Beanstalk environment with a "Scaling Trigger" using "CPUUtilization" and it works well.
The problem is that I can not combine this with a system that automatically reboots (or terminate) instances that have been considered "OutOfService" for a certain amount of time.
Into the "Scaling > Scaling Trigger > Trigger measurement" there is the option of "UnHealthyHostCount". But this won't solve my problem optimally, because it will create new instances as far there is one unhealthy, this will provoque my environment to grow until the limit without a real reason. Also, I can not combine 2 "Trigger measurements" and I need the CPU one.
The problem becomes crucial when there is only one instance in the environment, and it becomes OutOfService. The whole environment dies, the Trigger measurement is never triggered.
If you use Classic Load Balancer in your Elastic Beanstalk.
You can go to EC2 -> Auto Scaling Groups.
Then change the Health Check Type of the load balancer from EC2 to ELB.
By doing this, your instances of the Elastic Beanstalk will be terminated once they are not responding. A new instance will be created to replace the terminated instance.
AWS Elastic Beanstalk uses AWS Auto Scaling to manage the creation and termination of instances, including the replacement of unhealthy instances.
AWS Auto Scaling can integrate with the ELB (load balancer), also automatically created by Elastic Beanstalk, for health checks. ELB has a health check functionality. If the ELB detects that an instance is unhealthy, and if Auto Scaling has been configured to rely on ELB health checks (instead of the default EC2-based health checks), then Auto Scaling automatically replaces that instance that was deemed unhealthy by ELB.
So all you have to do is configure the ELB health check properly (you seem to have it correctly configured already, since you mentioned that you can see the instance being marked as OutOfService), and you also have to configure the Auto Scaling Group to use the ELB health check.
For more details on this subject, including the specific steps to configure all this, check these 2 links from the official documentation:
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.healthstatus.html#using-features.healthstatus.understanding
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/environmentconfig-autoscaling-healthchecktype.html
This should solve the problem. If you have trouble with that, please add a comment with any additional info that you might have after trying this.
Cheers!
You can setup a CloudWatch alarm to reboot the unhealthy instance using StatusCheckFailed_Instance metric.
For detailed information on each step, go through the Adding Reboot Actions to Amazon CloudWatch Alarms section in the following AWS Documentation.
If you want Auto Scaling to replace instances whose application has stopped responding, you can use a configuration file to configure the Auto Scaling group to use Elastic Load Balancing health checks. The following example sets the group to use the load balancer's health checks, in addition to the Amazon EC2 status check, to determine an instance's health.
Example .ebextensions/autoscaling.config
Resources:
AWSEBAutoScalingGroup:
Type: "AWS::AutoScaling::AutoScalingGroup"
Properties:
HealthCheckType: ELB
HealthCheckGracePeriod: 300
See: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/environmentconfig-autoscaling-healthchecktype.html

Will setting health check to ELB instead of EC2 ignore EC2 metrics like CPU Utilization?

If the autoscaling group's health check type is set to ELB then it will automatically remove any instances that fail the ELB health checks ( set in the healthcheck URL )
As long as the configured health check properly reports than an instance is bad (which sounds like it is the case since you say ELB is marking the instance as unhealthy) this should work, but does this mean other autoscaling triggers like CPU Utilization ( set in Configuration->Scaling->Scaling Trigger) be ignored?
Autoscaling group will not health check to ELB and vice versa.
ELB will check the health status of registered EC2 instances. ELB will continuously ping EC2 instance with specific port and specific page example port 80 and index.html page for every time period say 30 seconds or 60 seconds.
If any one of the registered instance is unhealthy then ELB will not send traffic to those instances and will not terminate or stop EC2 instances. ELB continuously check health status of EC2 instances which is registered in ELB.
If an unhealthy instance become healthy then ELB will send traffic to an instance.
AutoScaling group will health check to EC2 instances same like ELB do. But in AutoScaling group, if an EC2 instance goes to stopped state then it will terminate from the group and launch new instances with same configurations.
If Autoscaling group is integrated with ELB, newly added instance in the group will be added to ELB dashboard.
Health check cannot be done with ELB. You can monitor ELB using AWS CloudWatch logs and upload to target S3 bucket. You can enable monitoring feature in ELB and provide your target S3 bucket to store the logs.

How do unhealthy instances get onto my AWS load balancer?

We are using CodeDeploy to load code onto our instances as they boot up. Our intention was that they would not be added to the LB prior to the code being loaded. To do this, we set a health check which looked for one of the files being deployed. What we have found is that some times instances without code are created (I assume code deploy failed) and these instances are staying in the LB even when marked unhealthy? How is this possible? Is this related to the grace period? Shouldn't instances that are unhealthy be removed automatically?
I believe I have found a large part of my problem: My Auto-scale group was set to use EC2 health checks and not my ELB health check. This resulted in the instance not being terminated. The traffic may have continued to flow longer to this crippled instance due to the need the need for a very long unhealthy state before having traffic completely stopped.