ec2 instance terminated without completing lifecycle action - amazon-web-services

I am starting up a bunch of ec2 instances in an autoscaling group with autoscaling:EC2_INSTANCE_LAUNCHING and autoscaling:EC2_INSTANCE_TERMINATING lifecycle hooks. When I initiate an instance termination using aws management console, the instance gets terminated without waiting for me to complete the lifecycle action https://docs.aws.amazon.com/cli/latest/reference/autoscaling/complete-lifecycle-action.html
The instance state in autoscaling groups UI shows up as Terminating:Wait . The instance state in EC2 Instances UI shows up as Terminated . This is preventing me from taking an corrective actions before completing the lifecycle action and actually terminating the instance.
The same does not seem to apply for the case when I reduce the desired instances size in autoscaling group. It seems to go through the proper lifecycle stages when I reduce the desired instance size which in turn causes instance terminations.
Is this how aws asg lifecycle hooks are intended to work? They are pretty much useless for any asg instance terminations triggered outside of changing the desired instance size for the asg.

Yes, the lifecycle hooks will be called when Auto Scaling performs the scale-in/scale-out operation.
The fact that you are directly terminating the instance bypasses Auto Scaling, so it does not have an opportunity to activate the termination hooks. All it sees is that an instance is no longer healthy.
If you wish to terminate a specific instance in an Auto Scaling group, use terminate-instance-in-auto-scaling-group. That tells Auto Scaling to terminate the instance and the hooks will be used.

Related

When does AWS deregister an EC2 from the auto-scaling group during scale-in?

I am trying to figure out when and how AWS deregisters an EC2 from an auto-scaling group during scale-in. I am especially worried about a case when an EC2 which is about to be terminated will receive an incoming request shortly before being terminated. This would naturally cause processing of the requst to fail. The desired behavior would be for AWS to deregister the about-to-be-terminated instance from the group some configurable time before actually terminating it. I have found no documentation about this specific issue. Am I missing something?
There's no configuration that guarantees that a specific time elapses between deregistering an instance from the group and terminating it.
You can use Elastic Load Balancer health checks to remove instances that are not responding from the load balancer endpoint before they are terminated.
From AWS Documentation https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-termination-policies.html
Default termination policy
The default termination policy applies multiple termination criteria before selecting an instance to terminate. When Amazon EC2 Auto Scaling terminates instances, it first determines which Availability Zones have the most instances, and it finds at least one instance that is not protected from scale in. Within the selected Availability Zone, the following default termination policy behaviour applies:
Determine whether any of the instances eligible for termination use the oldest launch template or launch configuration:
[For Auto Scaling groups that use a launch template]
Determine whether any of the instances use the oldest launch template, unless there are instances that use a launch configuration. Amazon EC2 Auto Scaling terminates instances that use a launch configuration before it terminates instances that use a launch template.
[For Auto Scaling groups that use a launch configuration]
Determine whether any of the instances use the oldest launch configuration.
After applying the preceding criteria, if there are multiple unprotected instances to terminate, determine which instances are closest to the next billing hour. If there are multiple unprotected instances closest to the next billing hour, terminate one of these instances at random.

How can I control which EC2 instances get removed by an AutoScalingGroup using Amazon Web Services?

I have foreseen a problem that could happen with my application but I am unsure if it is possible to solve, and perhaps the architecture needs to be redesigned.
I am using an AutoScalingGroup (ASG) on AWS to create EC2 instances that host game servers that players can join. At the moment, the ASG is scaled manually via a matchmaking API which changes the desired capacity based on its needs. The problem occurs when a game server is finished.
When a game finishes, it signals to the matchmaker that it is finished and needs terminating, and the matchmaker will then scale down the ASG accordingly, however, it doesn't seem to know exactly which instance to remove, and it won't necessarily be the one that needs terminating.
I can terminate the instance, but then as the ASG desired capacity is never changed when the instance is terminated, another server is created.
Is there a way I can scale down the ASG, as well as specifying which servers to remove from the group?
In a nutshell, the default termination policy during scale in is designed to remove instances that use the oldest launch configuration.
Currently, Amazon EC2 Auto Scaling supports the following termination policie:
OldestInstance Terminate the oldest instance in the group. This option is useful when you're upgrading the instances in the Auto Scaling group to a new EC2 instance type. You can gradually replace instances of the old type with instances of the new type.
NewestInstance Terminate the newest instance in the group. This policy is useful when you're testing a new launch configuration but don't want to keep it in production.
OldestLaunchConfiguration Terminate instances that have the oldest launch configuration. This policy is useful when you're updating a group and phasing out the instances from a previous configuration.
ClosestToNextInstanceHour Terminate instances that are closest to the next billing hour. This policy helps you maximize the use of your instances and manage your Amazon EC2 usage costs.
Default Terminate instances according to the default termination policy. This policy is useful when you have more than one scaling policy for the group.
Instance protection
One of the possible solutions could be to use Instance protection. The auto-scaling provides an instance protection to control whether instance can be terminated when scaling-in.
Therefore, enable the instance protection for ASG to protect instances from scaling-in by default. Once you are done with you server, decrease a value of desired number of instances, remove instance protection from particular instance (either using CLI or SDK; note that this protection remains enabled for the rest of instances) and auto-scaling will terminate that exact instance.
For more information about instance protection, see Instance Protection
The oldest server is removed. If you want to scale down a specific server, you will have to kill that server before changing desired capacity.

AWS Autoscaling updating

You can create new Launch Configuration (updating AMI or whatever) and attach this with an existing Autoscaling Group. Per AWS Docs: After you change the launch configuration for an Auto Scaling group, any new instances are launched using the new configuration options, but existing instances are not affected.
How do you force this? Meaning relaunch all new instances now (with the new AMI). Do I have to delete the existing Autoscaling Group and create a new Autoscaling Group (with new Config)? Or I simple delete existing instances (one by one manually) and then ASG relaunch with new AMI. Any best practices/gotchas?
CloudFormation has the RollingUpdate flag (not sure of this outside of CF)
Thanks
AWS has some OOTB solutions for this, CloudFormation (like you say), Elastic Beanstalk (built on top of CF), and CodeDeploy blue-green deployments (I've not tried this).
Personally for our SQS polling ASG, we do a "cold deploy" i.e. only "deploy" when there are no messages to process (and hence, due a scaling policy, no instances). It's been really effective.
A deploy can be done safely whilst there are messages, provided that you set scale-in-protection on the instance during message processing (and remove it and wait briefly before polling):
set desired-capacity to 0
wait a bit (for there to be no instances running)
set desired-capacity back to N.
Note: you can do this all in the console.
You can code a solution yourself that does this... but I probably wouldn't.
Be careful:
simple delete existing instances (one by one manually)
Whether you can do this, or depends on whether the instances are still handling requests/processing (usually you can't simply terminate an instance without dropping service).
I recommend Elastic Beanstalk which gives a rolling update feature for free and is very easy to get started. I've not tried the CodeDeploy blue-green but it looks interesting. If you want more advanced behavior (or are already using it) look into Cloud Formation... do not code your own solution for rolling deployments: just use CloudFormation.
if your issue is with "in flight" requests simply enable connection draining or increase de-registration delay of the ELB or "target groups" attached with the ASG. You can set a value up to one hour.
When you enable connection draining, you can specify a maximum time for the load balancer to keep connections alive before reporting the instance as de-registered. The maximum timeout value can be set between 1 and 3,600 seconds (the default is 300 seconds). When the maximum time limit is reached, the load balancer forcibly closes connections to the de-registering instance.
Then you can detached old instances.
If you detach an instance from an Auto Scaling group that has an attached load balancer, the instance is deregistered from the load balancer. If you detach an instance from an Auto Scaling group that has an attached target group, the instance is deregistered from the target group. If connection draining is enabled for your load balancer, Auto Scaling waits for in-flight requests to complete.
If you don't want to do any manual scaling I guess the best approach is to changing the termination policy to OldestInstance and leave the ASG as it is. When the scale-in activity happens ASG will automatically terminate the old instances.(in your case old launch config instances)
OldestInstance. Auto Scaling terminates the oldest instance in the group. This option is useful when you're upgrading the instances in the Auto Scaling group to a new EC2 instance type. You can gradually replace instances of the old type with instances of the new type.

Does reducing desired instances in (AWS::AutoScaling::AutoScalingGroup) terminates the instance without stopping it

I have created a custom AMI wherein a service xxx is started when aws instance gets started and service xxx gets stopped when that aws instance gets stopped.
I have wrapped that ami under CloudFormation's AWS::AutoScaling::AutoScalingGroup.
When a new instance gets added using auto scaling group the xxx service too gets started on it. But when I reduce the number of desired instances the xxx does not get stopped.
So reducing desired numInstances in autoscaling group simply terminates the instance without stopping it?
How do I get the notification to stop the xxx service before instance termination?
Yes, when Auto Scaling scales down/in, it terminates instances (it does not simply stop them).
To react to such an event, you can use Lifecycle Hooks which allow you to perform custom actions as Auto Scaling launches or terminates instances.

How can I prevent EC2 instance termination by Auto Scaling?

I would like to prevent EC2 instance termination by Auto Scaling feature if that instance is in the middle of some sort of processing.
Background:
Suppose I have an Auto Scaling group that currently has 5 instances running.
I create an alarm on average CPU usage...
Suppose 4 of the instances are idle and one is doing some heavy processing...
The average CPU load will trigger the alarm and as a result the scale-down policy will execute.
How do I get Auto Scaling to terminate one of the idle instances and not the one that is in the middle of the processing?
Update
As noted by Ryan Walls (+1), AWS meanwhile provides Instance Protection to control whether Auto Scaling can terminate a particular instance when scaling in (see the introductory blog post Instance Protection for Auto Scaling for a walk through):
You can enable the instance protection setting on an Auto Scaling
group or an individual Auto Scaling instance. When Auto Scaling
launches an instance, the instance inherits the instance protection
setting of the Auto Scaling group. [...]
It's worth noting that this instance protection only applies to regular Auto Scaling scale in events:
Instance protection does not protect Auto Scaling instances from
manual termination through the Amazon EC2 console, the
terminate-instances command, or the TerminateInstances API. Instance
protection does not protect an Auto Scaling instance from termination
if it fails health checks and must be replaced. Also, instance
protection does not protect Spot instances in an Auto Scaling group
from interruption.
As usual, the feature is available via the AWS Management Console (menu Actions->Instance Protection->Set Scale In Protection)), the AWS CLI (set-instance-protection command), and the API (SetInstanceProtection API action).
The latter two options allow automation of the scenario at hand, i.e. one would need to enable instance protection before running 'heavy processing' jobs, and disable instance protection once they are finished so that the instance is eligible for termination again.
Initial Answer
This functionality is currently not available for Auto Scaling of Amazon EC2 instances - while you are indeed able to Configure [an] Instance Termination Policy for Your Auto Scaling Group, the available policies do not include such a (fairly advanced) concept:
Auto Scaling provides the following termination policy options for you
to choose from. You can specify one or more of these options in your
termination policy.
OldestInstance — Specify this if you want the oldest instance in your Auto Scaling group to be terminated. [...]
NewestInstance — Specify this if you want the last launched instance to be terminated. [...]
OldestLaunchConfiguration — Specify this if you want the instance launched using the oldest launch configuration to be
terminated. [...]
ClosestToNextInstanceHour — Specify this if you want the instance that is closest to completing the billing hour to be
terminated. [...]
Default — Specify this if you want Auto Scaling to use the default termination policy to select instances for termination.
I just successfully dealt with the problem of long-running jobs in an auto scaling group using the relatively recent lifecycle hook feature.
The problem with trying to choose an idle node to terminate, in my case, was that the process that chooses the idle node will race against processes that submit work to the nodes. In this case it's better to use a strategy where any node can be terminated, but termination happens gracefully so that no work is lost. You can then use all of the standard auto scaling policy stuff to manage scale-in and scale-out.
The termination lifecycle hook allows the user (or a process) to perform actions on the node after it has been placed into an intermediate state (labeled Terminating:Wait) by the auto scaling group. The user (or process) is then responsible for completing the lifecycle action via an AWS API call, resulting in the shutdown of the terminated EC2 instance.
The way I set this up, in short, is:
Create a role that allows auto scaling to post a message to an SQS queue.
Create an SQS queue for the termination messages.
Create a monitor script that runs as a service in each node. My script is a simple event-driven state machine that transitions in sequence from MONITORING (polling SQS for a termination message for the node) to DRAINING (polling a job queue until no work is being performed on the node) to TERMINATED (making the complete-lifecycle call).
Standard configuration for event-driven AWS auto-scaling; that is, creating CloudWatch alarms, and the auto-scaling policies for scale-in and scale-out.
One hinderance to this approach is that the lifecycle hook management isn't supported yet in the SDKs (boto, at least, doesn't support it AFAIK), nor are there Cloud Formation resources for the hooks.
The relevant AWS documentation is here:
http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/AutoScalingGroupLifecycle.html
Amazon has finally addressed this issue in a simpler way. There is now "instance protection" where you can mark your instance as protected and it will not be terminated during a "scale in".
See https://aws.amazon.com/blogs/aws/new-instance-protection-for-auto-scaling
aws-cli is your best friend..
Disable your scale down policy on your autoscaling group.
Create a cron job or scheduled task using aws-cli to:
2a. Get the EC2 instances associated with the autoscaling group
http://docs.aws.amazon.com/cli/latest/reference/autoscaling/describe-auto-scaling-instances.html
2b. Next monitor the cloudwatch statistics on the EC2 instances
http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/US_SingleMetricPerInstance.html
http://docs.aws.amazon.com/cli/latest/reference/cloudwatch/get-metric-statistics.html
2c. Terminate the idle EC2 instance(s) from your auto-scaling group
http://docs.aws.amazon.com/cli/latest/reference/autoscaling/terminate-instance-in-auto-scaling-group.html
You can use Amazon CloudWatch to achieve this:
http://aws.typepad.com/aws/2013/01/amazon-cloudwatch-alarm-actions.html. From the article:
You can use a similar strategy to get rid of instances that are tasked with handling compute-intensive batch processes. Once the CPU goes idle and the work is done, terminate the instance and save some money!
In this case, since you will be handling the termination, you will need to remove the scale-down policy. Also see another option: https://stackoverflow.com/a/19628453/432849.