AWS AutoScaling 'oldestinstance' Termination Policy does not always terminate oldest instances - amazon-web-services

Scenario
I am creating a script that will launch new instances into an AutoScaling Group and then remove the old instances. The purpose is to introduce newly created (or updated) AMI's to the AutoScaling Group. This is accomplished by increasing the Desired capacity by double the current number of instances. Then, after the new instances are Running, decreasing the Desired capacity by the same number.
Problem
When I run the script, I watch the group capacity increase by double, the new instances come online, they reach the Running state, and then the group capacity is decreased. Works like a charm. The problem is that SOMETIMES the instances that are terminated by the decrease are actually the new ones instead of the older ones.
Question
How can I ensure that the AutoScaling Group will always terminate the Oldest Instance?
Settings
The AutoScaling Group has the following Termination Polices: OldestInstance, OldestLaunchConfiguration. The Default policy has been removed.
The Default Cooldown is set to 0 seconds.
The Group only has one Availability Zone.
Troubleshooting
I played around with the Cooldown setting. Ended up just putting it on 0.
I waited different lengths of time to see if the existing servers needed to be running for a certain amount of time before they would be terminated. It seems that if they are less than 5 minutes old, they are less likely to be terminated, but not always. I had servers that were 20 minutes old that were not terminated instead of the new ones. Perhaps newly launched instances have some termination protection grace period?
Concession
I know that in most cases, the servers I will be replacing will have been running for a long time. In production, this might not be an issue. Still, it is possible that during the normal course of AutoScaling, an older server will be left running instead of a newer one. This is not an acceptable way to operate.
I could force specific instances to terminate, but that would defeat the point of the OldestInstance Termination Policy.
Update: 12 Feb 2014
I have continued to see this in production. Instances with older launch configs that have been running for weeks will be left running while newer instances will be terminated. At this point I am considering this to be a bug. A thread at Amazon was opened for this topic a couple years ago, apparently without resolution.
Update: 21 Feb 2014
I have been working with AWS support staff and at this point they have preliminarily confirmed it could be a bug. They are researching the problem.

It doesn't look like you can, precisely, because auto-scaling is trying to do one other thing for you in addition to having the correct number of instances running: keep your instance counts balanced across availability zones... and it prioritizes this consideration higher than your termination policy.
Before Auto Scaling selects an instance to terminate, it first identifies the Availability Zone that has more instances than the other Availability Zones used by the group. If all Availability Zones have the same number of instances, it identifies a random Availability Zone. Within the identified Availability Zone, Auto Scaling uses the termination policy to select the instance for termination.
— http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/us-termination-policy.html
If you're out of balance, then staying in balance is arguably the most sensible strategy, especially if you are using ELB. The documentation is a little ambiguous, but ELB will advertise one public IP in the DNS for each availability zone where it is configured; these three IP addresses will achieve the first tier of load balancing by virtue of round-robin DNS. If all of the availability zones where the ELB is enabled have healthy instances, then there appears to be a 1:1 correlation between which external IP the traffic hits and which availability zone's servers that traffic will be offered to by ELB -- at least that is what my server logs show. It appears that ELB doesn't route traffic across availability zones to alternate servers unless all of the servers in a given zone are detected as unhealthy, and that may be one of the justifications of why they've implemented autoscaling this way.
Although this algorithm might not always kill the oldest instance first on a region-wide basis, if it does operate as documented, it would kill off the oldest one in the selected availability zone, and at some point it should end up cycling through all of them over the course of several shifts in load... so it would not leave the oldest running indefinitely, either. The larger the number of instances in the group is, it seems like the less significant this effect should be.

There are a couple of other ways to do it:
Increase desired to 2x
Wait for action to increase capacity
When the new instances are running, suspend all AS activity (as-suspend-processes MyAutoScalingGroup)
Reset desired
Terminate old instances
Resume AS activity.
Or:
Bring up a brand new ASG with the new launch config.
Suspend AS activity , until 1. is finished.
If everything is ok, delete the old ASG.
Resume AS activity
For ultimate rollback deployment:
Create new ELB (might have to ask Amazon to provision more elb if you have a lot of traffic, this is kinda lame and makes it not automation friendly)
Create new ASG with new LC
Switch DNS to new ELB
Delete old ELB/ASG/LC if everything's fine, if not just change DNS back
Or with the new ASG API that lets you attach/detach instances from ASG:
Somehow bring up your new instances (could just be run-instances or create a temp asg)
Suspend AS activity, Use http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/attach-instance-asg.html to attach them to your old ASG,
Use http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/detach-instance-asg.html or terminate your old instances
Resume AS activity
The reason you might want to use your old ASG is because it can be a pita to set all the policies again (even when automated) and it feels a bit safer to change as little as possible.
A.

My use case is that we needed to scale down and be able to choose which machines go down. Unfortunately the termination policy "OldestFirst" was not working for us either. I was able to use a variant of the attach/detach method that ambakshi shared to remove the oldest (or any instance I choose) and at the same time lower the desired instances value of the autoscaling group.
Step 1 – Change the autoscaling group Min value to the number you want to scale down to.
Step 2 – Suspend the ASG
Step 3 – Detach the instances you want to terminate, you can do multiple instances in one command. Make sure to use the should-decrement-desired-capacity flag
Step 4 – Resume the ASG
Step 5 – Terminate your instances using the console or the CLI
UPDATE
There is no need to suspend the Auto Scaling Group, just doing steps 1, 3 and 5 worked for me. Just be aware of any availability zone balancing that may happen.

Related

Does an autoscaling group that contains spot instances respond to spot instance interruption notices?

I'm considering using spot instances in an auto-scaling group. While I'm aware that I'll receive a 'Spot instance interruption notice' if my spot instances are going to get terminated, what is unclear from the documentation is if my auto-scaling group will spin up new on-demand instances to replace these when the notice occurs, or if they only get replaced on termination. I'm aware that I could listen for these notices manually, but it seems like something that an auto-scaling group should be able to handle automatically.
I've tried testing this out on an existing auto-scaling group that had spot instances by changing the launch configurations 'spot price' to be lower than the current price. This did not work as it would only effect new instances and not currently running ones. I'm unsure of how to change an existing spot request's price.
What I'm hoping will happen is that on-demand instances will be spun up in the two minutes I have from the interruption notice till the time of termination.
If the Launch Configuration in your Auto Scaling Group is configured to use Spot instances then the new instance will indeed be a Spot instance.
The situation you describe is one of the challenges of using Spot instances; although the cost is very low, Spot instances can be terminated and the underlying resources used for a paying customer to fulfill an on-demand or Reserved Instance at anytime.
One way to avoid this is to use Reserved Instances. If you have a predictable long-term need for an instance, or are running a production workload, using Reserved instances is an effective way to lower your costs (albeit, not as low as a spot instance) without having to worry that you could lose your instance(s) at anytime.
Regarding changing the price, updates to pricing are applied to new instances only. After updating pricing simply terminate your existing instances and they’ll be replaced by your ASG with instances at the new price.

How can I control which EC2 instances get removed by an AutoScalingGroup using Amazon Web Services?

I have foreseen a problem that could happen with my application but I am unsure if it is possible to solve, and perhaps the architecture needs to be redesigned.
I am using an AutoScalingGroup (ASG) on AWS to create EC2 instances that host game servers that players can join. At the moment, the ASG is scaled manually via a matchmaking API which changes the desired capacity based on its needs. The problem occurs when a game server is finished.
When a game finishes, it signals to the matchmaker that it is finished and needs terminating, and the matchmaker will then scale down the ASG accordingly, however, it doesn't seem to know exactly which instance to remove, and it won't necessarily be the one that needs terminating.
I can terminate the instance, but then as the ASG desired capacity is never changed when the instance is terminated, another server is created.
Is there a way I can scale down the ASG, as well as specifying which servers to remove from the group?
In a nutshell, the default termination policy during scale in is designed to remove instances that use the oldest launch configuration.
Currently, Amazon EC2 Auto Scaling supports the following termination policie:
OldestInstance Terminate the oldest instance in the group. This option is useful when you're upgrading the instances in the Auto Scaling group to a new EC2 instance type. You can gradually replace instances of the old type with instances of the new type.
NewestInstance Terminate the newest instance in the group. This policy is useful when you're testing a new launch configuration but don't want to keep it in production.
OldestLaunchConfiguration Terminate instances that have the oldest launch configuration. This policy is useful when you're updating a group and phasing out the instances from a previous configuration.
ClosestToNextInstanceHour Terminate instances that are closest to the next billing hour. This policy helps you maximize the use of your instances and manage your Amazon EC2 usage costs.
Default Terminate instances according to the default termination policy. This policy is useful when you have more than one scaling policy for the group.
Instance protection
One of the possible solutions could be to use Instance protection. The auto-scaling provides an instance protection to control whether instance can be terminated when scaling-in.
Therefore, enable the instance protection for ASG to protect instances from scaling-in by default. Once you are done with you server, decrease a value of desired number of instances, remove instance protection from particular instance (either using CLI or SDK; note that this protection remains enabled for the rest of instances) and auto-scaling will terminate that exact instance.
For more information about instance protection, see Instance Protection
The oldest server is removed. If you want to scale down a specific server, you will have to kill that server before changing desired capacity.

Automatically terminate Auto Scaling instances after a time period

We use Amazon EC2 Auto Scaling groups for a bunch of apps - as everyone knows, while you try your hardest not to have memory leaks and other "been up for a while problems" - it's still possible.
We'd like to protect against such possibilities by just bouncing the servers - ie make sure an instance only lasts, say, 24 hours before killing it. However, we want the killing to be "safe" - eg - even if there's only one instance in the group, we want to start up another instance to a working state, then kill the old box.
Is there any support for this? eg a time-to-live property on an instance?
There is no such property in Amazon EC2 nor in Auto Scaling.
You could manually set the instance health to Unhealthy, which would cause Auto Scaling to terminate and replace the instance. However, if you have only one instance then there will likely be a period where there are no instances.
You could set the Auto Scaling termination policy to OldestInstance, which means that when Auto scaling needs to terminate an instance, it will terminate the oldest instance within the AZ that has the most instances. This gets rid of old instances, but only when the group is scaled-in.
Therefore, you could supplement the Termination Policy with a script that scales-out the group and then causes it to scale-in again. For example, double the number of instances, wait for them to launch, and then halve the number of instances. This should cause them all to refresh (with a few edge conditions if your instances are spread across multiple AZs, causing non-even counts).
Another option is to restart the instance(s). This will not cause them to appear unhealthy to Auto Scaling, but they will appear unhealthy to a Load Balancer. (If you have activated ELB Health Checks within Auto Scaling, then Auto Scaling would actually terminate instances the fail the health check.) You can use Scheduled Events for Your Instances to have Amazon CloudWatch Events restart your instance(s) at certain intervals, or even have a script on the instance tell the Operating System to restart at certain intervals.
However, there is no automatic option to do exactly what you asked.
Since 2019, there has been a Maximum Instance Lifetime parameter, that almost does what you wanted.
Unfortunately, though, it isn’t possible to set the maximum instance lifetime to 24 hours (86400 seconds): the minimum is a week.
Maximum instance lifetime must be equal to 0, between 604800 and 31536000 seconds (inclusive), or not specified.

AWS Is it possible to automatically terminate and recreate new instances for an auto scaling group periodically?

We have an AWS scaling group that has 10-20 servers behind a load balancer. After running for a couple of weeks some these server go bad. We have no idea why the servers go bad and it will take some time for us to get to a stage where we can debug this issue.
In the interim is there a way to tell AWS to terminate all the instances in the scaling group in a controlled fashion (one by one) until all the instances are replaced by new ones every week or so?
You can achieve this very effectively using Data Pipeline.
This is the developer guide for How do I stop and start Amazon EC2 Instances at scheduled intervals with AWS Data Pipeline?
There is no function in Auto Scaling to tell it to automatically terminate and replace instances. However, you could script such functionality.
Assumptions:
Terminate instances that are older than a certain number of hours old
Do them one-at-a-time to avoid impacting available capacity
You wish to replace them immediately
A suitable script would do the following:
Loop through all instances in a given Auto-Scaling Group using describe-auto-scaling-instances
If the instance belongs to the desired Auto Scaling group, retrieve its launch time via describe-instances
If the instance is older than the desired number of hours, terminate it using terminate-instance-in-auto-scaling-group with --no-should-decrement-desired-capacity so that it is automatically replaced
Then, wait a few minutes to allow it to be replaced and continue the loop
The script could be created by using the AWS Command-Line Interface (CLI) or a programming language such as Python.
Alternatively, you could program the instances to self-destruct after a given period of time (eg 72 hours) by simply calling the operating system to shut-down the instance. This would cause auto-scaling to terminate the instance and replace it.
There are two ways to achieve what you are looking for, Scheduled Auto Scaling Actions or take them one of the instances out of the ASG.
Scheduled Scaling
Scaling based on a schedule allows you to scale your application in response to predictable load changes. For example, every week the traffic to your web application starts to increase on Wednesday, remains high on Thursday, and starts to decrease on Friday. You can plan your scaling activities based on the predictable traffic patterns of your web application.
https://docs.aws.amazon.com/autoscaling/latest/userguide/schedule_time.html
You most likely want this.
Auto Scaling enables you to put an instance that is in the InService state into the Standby state, update or troubleshoot the instance, and then return the instance to service. Instances that are on standby are still part of the Auto Scaling group, but they do not actively handle application traffic.
https://docs.aws.amazon.com/autoscaling/latest/userguide/as-enter-exit-standby.html
As of Nov 20, 2019, EC2 AutoScaling supports Max Instance Lifetime: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-ec2-auto-scaling-supports-max-instance-lifetime/
From:
The maximum instance lifetime specifies the maximum amount of time (in
seconds) that an instance can be in service. The maximum duration
applies to all current and future instances in the group. As an
instance approaches its maximum duration, it is terminated and
replaced, and cannot be used again.
When configuring the maximum instance lifetime for your Auto Scaling
group, you must specify a value of at least 86,400 seconds (1 day). To
clear a previously set value, specify a new value of 0.

Automatic recovery from an availability zone outage?

Are there any tools or techniques available to automatically create new instances in a different availability zone in the event that an availability zone suffers an outage in Amazon Web Services/EC2?
I think I understand how to do automatic fail over in the event of an availability zone (AZ) outage, but what about automatic recovery (create new instances in a new AZ) from an outage? Is that possible?
Example scenario:
We have a three-instance cluster.
An ELB round-robins traffic to the cluster.
We can lose any one instance, but not two instances in the cluster, and still be fully functional.
Because of (3), each instance is in a different AZ. Call them AZs A, B and C.
The ELB health check is configured so that the ELB can ensure each instance is healthy.
Assume that one instance is lost due to an AZ outage in AZ A.
At this point the ELB will see that the lost instance is no longer responding to health checks and will stop routing traffic to that instance. All requests will go to the two remaining healthy instances. Failover is successful.
Recovery is where I am not clear. Is there a way to automatically (i.e. no human intervention) replace the lost instance in a new AZ (e.g. AZ D)? This will avoid the AZ that had the outage (A) and not use an AZ that already has an instance in it (AZs B and C).
AutoScaling Groups?
AutoScaling Groups seem like a promising place to start, but I don't know if they can deal with this use case properly.
Questions:
In an AutoScaling Group there doesn't seem to be a way to specify that the new instances that replace dead/unhealthy instances should be created in a new AZ (e.g. create it in AZ D, not in AZ A). Is this really true?
In an AutoScaling Group there doesn't seem to be a way to tell the ELB to remove the failed AZ and automatically add a new AZ. Is that right?
Are these true shortcomings in AutoScaling Groups, or am I missing something?
If this can't be done with AutoScaling Groups, is there some other tool that will do this for me automatically?
In 2011 FourSquare, Reddit and others were caught by being reliant on a single availability zone (http://www.informationweek.com/cloud-computing/infrastructure/amazon-outage-multiple-zones-a-smart-str/240009598). It seems like since then tools would have come a long way. I have been surprised by the lack of automated recovery solutions. Is each company just rolling its own solution and/or doing the recovery manually? Or maybe they're just rolling the dice and hoping it doesn't happen again?
Update:
#Steffen Opel, thanks for the detailed explanation. Auto scaling groups are looking better, but I think there is still an issue with them when used with an ELB.
Suppose I create a single auto scaling group with a min, max & desired set to 3, spread across 4 AZs. Auto scaling would create 1 instance in 3 different AZs, with the 4th AZ left empty. How do I configure the ELB? If it forwards to all 4 AZs, that won't work because one AZ will always have zero instances and the ELB will still route traffic to it. This will result in HTTP 503s being returned when traffic goes to the empty AZ. I have experienced this myself in the past. Here is an example of what I saw before.
This seems to require manually updating the ELB's AZs to just those with instances running in them. This would need to happen every time auto scaling results in a different mix of AZs. Is that right, or am I missing something?
Is there a way to automatically (i.e. no human intervention) replace the lost instance in a new AZ (e.g. AZ D)?
Auto Scaling is indeed the appropriate service for your use case - to answer your respective questions:
In an AutoScaling Group there doesn't seem to be a way to specify that the new instances that replace dead/unhealthy instances should be created in a new AZ (e.g. create it in AZ D, not in AZ A). Is this really true? In an AutoScaling Group there doesn't seem to be a way to tell the ELB to remove the failed AZ and automatically add a new AZ. Is that right?
You don't have to specify/tell anything of that explicitly, it's implied in how Auto Scaling works (See Auto Scaling Concepts and Terminology) - You simply configure an Auto Scaling group with a) the number of instances you want to run (by defining the minimum, maximum, and desired number of running EC2 instances the group must have) and b) which AZs are appropriate targets for your instances (usually/ideally all AZs available in your account within a region).
Auto Scaling then takes care of a) starting the requested number of instances and b) balancing these instance in the configured AZs. An AZ outage is handled automatically, see Availability Zones and Regions:
Auto Scaling lets you take advantage of the safety and reliability of geographic redundancy by spanning Auto Scaling groups across multiple Availability Zones within a region. When one Availability Zone becomes unhealthy or unavailable, Auto Scaling launches new instances in an unaffected Availability Zone. When the unhealthy Availability Zone returns to a healthy state, Auto Scaling automatically redistributes the application instances evenly across all of the designated Availability Zones. [emphasis mine]
The subsequent section Instance Distribution and Balance Across Multiple Zones explains the algorithm further:
Auto Scaling attempts to distribute instances evenly between the Availability Zones that are enabled for your Auto Scaling group. Auto Scaling does this by attempting to launch new instances in the Availability Zone with the fewest instances. If the attempt fails, however, Auto Scaling will attempt to launch in other zones until it succeeds. [emphasis mine]
Please check the linked documentation for even more details and how edge cases are handled.
Update
Regarding your follow up question about the number of AZs being higher than the number of instances,
I think you need to resort to a pragmatic approach:
You should simply select a number of AZz equal or lower than the number of instances you want to run; in case of an AZ outage, Auto Scaling will happily balance your instances across the remaining healthy AZs, which means you'd be able to survive the outage of 2 out of 3 AZs in your example and still have all 3 instances running in the remaining AZ.
Please note that while it might be intriguing to use as many AZs as are available, New customers can access three EC2 Availability Zones in US East (Northern Virginia) and two in US West (Northern California) only anyway (see Global Infrastructure), i.e. only older accounts might actually have access to all 5 AZs in us-east-1, some just 4 and newer ones 3 at most.
I consider this to be a legacy issue, i.e. AWS is apparently rotating older AZs out of operation. For example, even if you have access to all 5 AZs in us-east-1, some instances types might not be available in all of these in fact (e.g. the New EC2 Second Generation Standard Instances m3.xlarge and m3.2xlarge are only available in 3 out of 5 AZs in one of the accounts I'm using).
Put another way, 2-3 AZs are considered to be a fairly good compromise for fault tolerance within a region, if anything cross region fault tolerance would probably be the next thing I'd be worried about.
there are many ways to solve this problem. without knowing the particulars of what your "cluster" is and how a new node comes alive, maybe registers with a master, loads data, etc, to bootstrap. for instance on hadoop, a new slave node needs to be registered with the namenode that will be serving it content. but ignoring that. just focusing on a startup of a new node.
you can use the cli tools for windows or linux instances. i fire them off from both my dev box in both os's and on the servers both os's. here is the link for linux for example:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/setting_up_ec2_command_linux.html#set_aes_home_linux
They consist of scores of commands that you can execute at the dos or linux shell to do things like fire off an instance or terminate one. they require the configuring of environment variables like your aws credentials and the path to java. here is an example input and output for creating an instance in AvailZone=us-east-1d
sample command:
ec2-request-spot-instances ami-52009e3b -p 0.02 -z us-east-1d --key DrewKP3 --group linux --instance-type m1.medium -n 1 --type one-time
sample output:
SPOTINSTANCEREQUEST sir-0fd0dc32 0.020000 one-time Linux/UNIX open 2013-05-01T09:22:18-0400 ami-52009e3b m1.medium DrewKP3 linux us-east-1d monitoring-disabled
note I am being a cheap-wad and using a 2 cent Spot Instance whereby you would be using a standard instance and not spot. but then again I am creating hundreds of servers.
alright, so you have a database. for argument sake, let's say you have AWS RDS mysql, micro instance running in Multi-AvailZone mode for an extra half a cent an hr. that is is 72 cents a day. It contains a table, call it zonepref (AZ,preference). such as
us-west-1b,1
us-west-1c,2
us-west-2b,3
us-east-1d,4
eu-west-1b,5
ap-southeast-1a,6
you get the idea. The preference of zones.
there is another table in RDS that is something like "active_nodes" with columns IP addr, instance-id,zone,lastcontact,status (string,string,string,datetime,char). let's say it contains the following active nodes info:
'10.70.132.101','i-2c55bb41','us-east-1d','2013-05-01 11:18:09','A'
'10.70.132.102','i-2c66bb42','us-west-1b','2013-05-01 11:14:34','A'
'10.70.132.103','i-2c77bb43','us-west-2b','2013-05-01 11:17:17','A'
'A'=Alive and healthy, 'G'=going dead, 'D'=Dead
now your node on startup establishes either a cron job or runs a service, let's call it a server that is in any language of your liking like java or ruby. this is baked into your ami to run at startup, and on initialization it goes out and does an insert of its data into the active_nodes table so its row is there. at a minimum it runs every, say, 5 min (depending on how mission critical this whole thing is). the cron job would run at that interval or the java/ruby would create a thread that would sleep for that amount of time. when it comes to life, it grabs its ipaddr,instanceid,AZ, and makes a call to RDS to update it's row where status='A' using UTC time for lastcontact which is consistent across timezones. If it's status is not 'A' then no update will occur.
In addition it updates the status column of any other ip addr row in there that is status='A', changing it to status='G' (going dead) for any, like I said, other ipaddr that now()-lastcontact is greater than, say, 6 or 7 minutes. Additionally it can using sockets (pick a port) contact that Going Dead server and say, hey, are you there ? If so, maybe that Going Dead server merely can't access RDS tho it is in Multi-AZ but can still handle other traffic. If no contact then change the other server status to 'D'=Dead. Refine as needed.
The concept of writing the 'server' that runs on its node here is one that has a housekeeping thread that sleeps, and the main thread that will block/listen on a port. the whole thing can be written in ruby in less than 50 to 70 lines of code.
The servers can use the CLI and terminate the instance id's of other servers, but before doing so it would do something like issue a select statement from table zonepref ordered by preference for the first row that is not in active_nodes. it now has the next zone, it runs ec2-run-instances with the correct ami-id and next zone etc, passing along user data if necessary. You don't want both the Alive servers to create a new instance, so either wrap the create with a row lock in mysql or push the request onto a queue or a stack so only one of them perform it.
anyway, might seem like overkill, but i do a lot of cluster work where nodes have to talk to one another directly. Note that I am not suggesting that just because a node seems to have lost its heartbeat that its AZ has gone down :> Maybe just that instance lost its lunch.
Not enough rep to comment.
I wanted to add that an ELB will not route traffic to an empty AZ. This is because ELB's route traffic to instances, not AZ's.
Attaching AZ's to an ELB merely creates an Elastic Network Interface in a subnet in that AZ so that traffic could be routed if an instance in that AZ is added. It's adding instances (for which the AZ associated with the instance but also be associated with the ELB) that creates the routing.