How to debug EC2 instances where custom health checks fail - amazon-web-services

I have an auto-scaling group with EC2 that implement a custom health.
From time to time, the health check fails and instances are terminated and replaced.
The health check itself is implemented as a shell script that runs on the instances. If it detects problems, it will inform the auto scaling group via the AWS API:
aws autoscaling set-instance-health --instance-id $instance --health-status Unhealthy
The problem is only that I have no information about what check failed, beside the notification:
Cause: At 2017-06-13T09:11:47Z an instance was taken out of service in response to a user health-check
What is the recommended way to debug these type of problems. Is there a way to make AWS only stop instances and not terminate them, so their disk state could be inspected?
(First I thought about "enable termination protection", but from my understanding this will not make a difference, here. Autoscaling group will still terminate the instances when the shutdown was requested by a failing custom health check.)

Using the set-instance-health command tells Auto Scaling that the instance is unhealthy and needs to be replaced. Auto Scaling will then terminate the unhealthy instance and launch a new instance to replace it.
If you wish to perform forensic analysis on an unhealthy instance, remove it from the Auto Scaling group with the aws autoscaling detach-instances command:
Removes one or more instances from the specified Auto Scaling group. After the instances are detached, you can manage them independent of the Auto Scaling group.
If you do not specify the option to decrement the desired capacity, Auto Scaling launches instances to replace the ones that are detached.
If there is a Classic Load Balancer attached to the Auto Scaling group, the instances are deregistered from the load balancer. If there are target groups attached to the Auto Scaling group, the instances are deregistered from the target groups.
So, instead of calling set-instance-health, call detach-instances (and optionally replace it). You can then debug the instance. If you wish to send it back into service, use aws autoscaling attach-instances.

Related

Why do I need to enable instance scale-in protection in my Auto Scaling Group when enabling a capacity provider within a cluster?

I am making an AWS ECS cluster using EC2 and trying to use capacity providers. I don't really understand why do I need to enable instance scale-in protection inside my AWS Auto Scaling group.
Isn't the point of Auto Scaling the termination of needless EC2 instances?
why do I need to enable instance scale-in protection
This is only needed when you chose to use managed scaling:
When managed scaling is enabled, Amazon ECS manages the scale-in and scale-out actions of the Auto Scaling group used when creating the capacity provider. On your behalf, Amazon ECS creates an AWS Auto Scaling scaling plan with a target tracking scaling policy based on the target capacity value you specify.
The managed scaling ensures that ECS controlled when instances are removed. By doing this it protects any instances that have some tasks running on it from being terminated:
When managed termination protection is enabled, Amazon ECS prevents Amazon EC2 instances that contain tasks and that are in an Auto Scaling group from being terminated during a scale-in action.
The entire idea is that you enable instance scale-in protection on your ASG so that ECS has control over which instances to terminate based on tasks they run. Without this, your ASG could terminate instances based on other criteria, not nervelessly related to "needless EC2 instances". For example, ASG can choose to terminate instances based due to AZRebalance process. This could lead to ASG terminating instances with running tasks, which may not be what you want.

Does AWS Autoscaling marks "stopped" instance as unhealty and spins a new instance?

I am working on building a quick start and trying to understand below statement from https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Stop_Start.html
"If your instance is in an Auto Scaling group, the Amazon EC2 Auto Scaling service marks the stopped instance as unhealthy, and may terminate it and launch a replacement instance. For more information, see Health Checks for Auto Scaling Instances in the Amazon EC2 Auto Scaling User Guide."
Can someone help cases that may lead to termination or is it always a termination?

Can we stop EC2 instance from Auto Scaling Group from AWS

I have an Auto Scaling Group and I want to stop that instance from Auto Scaling Group rather than terminating, Is it possible to do so?
No. From the official definition:
Auto Scaling is a web service designed to launch or terminate Amazon EC2 instances automatically based on user-defined policies, schedules, and health checks.
When scaling-out, new instances are launched into the Auto Scaling group.
When scaling-in, instances are terminated.
Auto Scaling does not start/start instances.
Some benefits of this approach are:
Instances can be launched in different Availability Zones in case there is a failure in a particular AZ
Failed instances can be easily replaced
There is no limit to the number of instances that could be launched (compared to running out of 'stopped' instances)
Launch Configurations can be updated, so any newly-launched instances will use the new configuration (as opposed to recycling old instances)

AWS Beanstalk, how to reboot (or terminate) automatically an instance that is not responding

I have my Beanstalk environment with a "Scaling Trigger" using "CPUUtilization" and it works well.
The problem is that I can not combine this with a system that automatically reboots (or terminate) instances that have been considered "OutOfService" for a certain amount of time.
Into the "Scaling > Scaling Trigger > Trigger measurement" there is the option of "UnHealthyHostCount". But this won't solve my problem optimally, because it will create new instances as far there is one unhealthy, this will provoque my environment to grow until the limit without a real reason. Also, I can not combine 2 "Trigger measurements" and I need the CPU one.
The problem becomes crucial when there is only one instance in the environment, and it becomes OutOfService. The whole environment dies, the Trigger measurement is never triggered.
If you use Classic Load Balancer in your Elastic Beanstalk.
You can go to EC2 -> Auto Scaling Groups.
Then change the Health Check Type of the load balancer from EC2 to ELB.
By doing this, your instances of the Elastic Beanstalk will be terminated once they are not responding. A new instance will be created to replace the terminated instance.
AWS Elastic Beanstalk uses AWS Auto Scaling to manage the creation and termination of instances, including the replacement of unhealthy instances.
AWS Auto Scaling can integrate with the ELB (load balancer), also automatically created by Elastic Beanstalk, for health checks. ELB has a health check functionality. If the ELB detects that an instance is unhealthy, and if Auto Scaling has been configured to rely on ELB health checks (instead of the default EC2-based health checks), then Auto Scaling automatically replaces that instance that was deemed unhealthy by ELB.
So all you have to do is configure the ELB health check properly (you seem to have it correctly configured already, since you mentioned that you can see the instance being marked as OutOfService), and you also have to configure the Auto Scaling Group to use the ELB health check.
For more details on this subject, including the specific steps to configure all this, check these 2 links from the official documentation:
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.healthstatus.html#using-features.healthstatus.understanding
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/environmentconfig-autoscaling-healthchecktype.html
This should solve the problem. If you have trouble with that, please add a comment with any additional info that you might have after trying this.
Cheers!
You can setup a CloudWatch alarm to reboot the unhealthy instance using StatusCheckFailed_Instance metric.
For detailed information on each step, go through the Adding Reboot Actions to Amazon CloudWatch Alarms section in the following AWS Documentation.
If you want Auto Scaling to replace instances whose application has stopped responding, you can use a configuration file to configure the Auto Scaling group to use Elastic Load Balancing health checks. The following example sets the group to use the load balancer's health checks, in addition to the Amazon EC2 status check, to determine an instance's health.
Example .ebextensions/autoscaling.config
Resources:
AWSEBAutoScalingGroup:
Type: "AWS::AutoScaling::AutoScalingGroup"
Properties:
HealthCheckType: ELB
HealthCheckGracePeriod: 300
See: https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/environmentconfig-autoscaling-healthchecktype.html

Stop traffic to unhealthy instances without replacing them by auto scaling

Using ELB for TCP protocol and AWS Auto Scaling I run into the following problem when scaling out.
three EC2 instances each with 2,000 connections
scaling out because that is my specified threshold
a new instance gets added by Auto Scaling
How can I stop now traffic going to the three EC2 instances which have too many connections?
Removing it from ELB will mean that it will get terminated after a maximum of 1h using connection draining. Bad: TCP connections will get closed.
Marking the EC2 instance as unhealthy using CloudWatch. Bad: Auto Scaling will detect and replace unhealthy instances
Detaching EC2 instance from Auto Scaling group manually via AWS CLI.
Bad: Detaching it from Auto Scaling will also remove it from ELB, see 1.
The only possible solution I can see here and I am not sure if it is feasible:
Using CloudWatch mark the EC2 instance as unhealthy. ELB will stop distributing traffic to it. At the same time update the EC2 health for Auto Scaling manually:
aws autoscaling set-instance-health --instance-id i-123abc45d –-health-status healthy
This should override the health in a way that ELB will continue to ignore the EC2 instance and AWS Auto Scaling will not try to replace the instance. Would that work or is there a better solution?