EC2 Auto Scaling Group's Instance refresh goes below Healthy threshold - amazon-web-services

I have an ASG with desired/min/max of 1/1/5 instances (I want ASG just for rolling deploys and zone failover). When I start the Instance refresh with MinHealthyPercentage=100,InstanceWarmup=180, the process starts by deregistration (the instance goes to draining mode almost immediately on my ALB, instead waiting the 180 Warmup seconds until the new instance is healthy) and the application becomes unavailable for a while.
Note that this is not specific just to my case with one instance. If I had two instances, the process also starts by deregistering one of the instances and that does not fulfill the 100% MinHealthy constraint either (the app will stay available, though)!
Is there any other configuration option I should tune to get the rolling update create and warm up the new instance first?

Currently instance refresh always terminates before launching, and it uses the minHealthyPercent to determine batch size and when it can move on to the next batch.
It takes a set of instances out of service, terminates them, and launches a set of instances with the new desired configuration. Then, it waits until the instances pass your health checks and complete warmup before it moves on to replacing other instances.
...
Setting the minimum healthy percentage to 100 percent limits the rate of replacement to one instance at a time. In contrast, setting it to 0 percent causes all instances to be replaced at the same time.
https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-instance-refresh.html

If you are running the 1 instance and using the Launch template with the Autoscaling it would be hard to rolling update the EC2 instance.
i am coming from the above scenario and hitting up on this immature feature of AWS.
it's mentioned in the limitation of instance refresh, it will scale down the instance and will recreate the new one instead of creating the first new one instance.
Instances terminated before launch: When there is only one instance in
the Auto Scaling group, starting an instance refresh can result in an
outage. This is because Amazon EC2 Auto Scaling terminates an instance
and then launches a new instance.
Ref : https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-instance-refresh.html
i tried work around of scaling up the auto-scaling group desired size to 2, it will create a new instance with the latest AMI in the launch template.
Now you have two instances running the old version & latest version, you will be good to set the desired capacity now back to 1 in the auto-scaling group.
Auto-scaling desired capacity to 1 will delete the older instance and keep the latest instance with the latest AMI.
Command to update desired capacity to 2
- aws autoscaling update-auto-scaling-group --auto-scaling-group-name $ASG_GROUP --desired-capacity 2
Command to update desired capacity to 1
- aws autoscaling update-auto-scaling-group --auto-scaling-group-name $ASG_GROUP --desired-capacity 1
Instead of using the instance-refresh this worked well for me.

This does not seem to be the case anymore. An instance refresh creates now a fresh instance and terminates the old one after health checks are successful. AWS Support mentioned this behavior was not changed since 2020.

Related

How can I control which EC2 instances get removed by an AutoScalingGroup using Amazon Web Services?

I have foreseen a problem that could happen with my application but I am unsure if it is possible to solve, and perhaps the architecture needs to be redesigned.
I am using an AutoScalingGroup (ASG) on AWS to create EC2 instances that host game servers that players can join. At the moment, the ASG is scaled manually via a matchmaking API which changes the desired capacity based on its needs. The problem occurs when a game server is finished.
When a game finishes, it signals to the matchmaker that it is finished and needs terminating, and the matchmaker will then scale down the ASG accordingly, however, it doesn't seem to know exactly which instance to remove, and it won't necessarily be the one that needs terminating.
I can terminate the instance, but then as the ASG desired capacity is never changed when the instance is terminated, another server is created.
Is there a way I can scale down the ASG, as well as specifying which servers to remove from the group?
In a nutshell, the default termination policy during scale in is designed to remove instances that use the oldest launch configuration.
Currently, Amazon EC2 Auto Scaling supports the following termination policie:
OldestInstance Terminate the oldest instance in the group. This option is useful when you're upgrading the instances in the Auto Scaling group to a new EC2 instance type. You can gradually replace instances of the old type with instances of the new type.
NewestInstance Terminate the newest instance in the group. This policy is useful when you're testing a new launch configuration but don't want to keep it in production.
OldestLaunchConfiguration Terminate instances that have the oldest launch configuration. This policy is useful when you're updating a group and phasing out the instances from a previous configuration.
ClosestToNextInstanceHour Terminate instances that are closest to the next billing hour. This policy helps you maximize the use of your instances and manage your Amazon EC2 usage costs.
Default Terminate instances according to the default termination policy. This policy is useful when you have more than one scaling policy for the group.
Instance protection
One of the possible solutions could be to use Instance protection. The auto-scaling provides an instance protection to control whether instance can be terminated when scaling-in.
Therefore, enable the instance protection for ASG to protect instances from scaling-in by default. Once you are done with you server, decrease a value of desired number of instances, remove instance protection from particular instance (either using CLI or SDK; note that this protection remains enabled for the rest of instances) and auto-scaling will terminate that exact instance.
For more information about instance protection, see Instance Protection
The oldest server is removed. If you want to scale down a specific server, you will have to kill that server before changing desired capacity.

AWS Autoscaling updating

You can create new Launch Configuration (updating AMI or whatever) and attach this with an existing Autoscaling Group. Per AWS Docs: After you change the launch configuration for an Auto Scaling group, any new instances are launched using the new configuration options, but existing instances are not affected.
How do you force this? Meaning relaunch all new instances now (with the new AMI). Do I have to delete the existing Autoscaling Group and create a new Autoscaling Group (with new Config)? Or I simple delete existing instances (one by one manually) and then ASG relaunch with new AMI. Any best practices/gotchas?
CloudFormation has the RollingUpdate flag (not sure of this outside of CF)
Thanks
AWS has some OOTB solutions for this, CloudFormation (like you say), Elastic Beanstalk (built on top of CF), and CodeDeploy blue-green deployments (I've not tried this).
Personally for our SQS polling ASG, we do a "cold deploy" i.e. only "deploy" when there are no messages to process (and hence, due a scaling policy, no instances). It's been really effective.
A deploy can be done safely whilst there are messages, provided that you set scale-in-protection on the instance during message processing (and remove it and wait briefly before polling):
set desired-capacity to 0
wait a bit (for there to be no instances running)
set desired-capacity back to N.
Note: you can do this all in the console.
You can code a solution yourself that does this... but I probably wouldn't.
Be careful:
simple delete existing instances (one by one manually)
Whether you can do this, or depends on whether the instances are still handling requests/processing (usually you can't simply terminate an instance without dropping service).
I recommend Elastic Beanstalk which gives a rolling update feature for free and is very easy to get started. I've not tried the CodeDeploy blue-green but it looks interesting. If you want more advanced behavior (or are already using it) look into Cloud Formation... do not code your own solution for rolling deployments: just use CloudFormation.
if your issue is with "in flight" requests simply enable connection draining or increase de-registration delay of the ELB or "target groups" attached with the ASG. You can set a value up to one hour.
When you enable connection draining, you can specify a maximum time for the load balancer to keep connections alive before reporting the instance as de-registered. The maximum timeout value can be set between 1 and 3,600 seconds (the default is 300 seconds). When the maximum time limit is reached, the load balancer forcibly closes connections to the de-registering instance.
Then you can detached old instances.
If you detach an instance from an Auto Scaling group that has an attached load balancer, the instance is deregistered from the load balancer. If you detach an instance from an Auto Scaling group that has an attached target group, the instance is deregistered from the target group. If connection draining is enabled for your load balancer, Auto Scaling waits for in-flight requests to complete.
If you don't want to do any manual scaling I guess the best approach is to changing the termination policy to OldestInstance and leave the ASG as it is. When the scale-in activity happens ASG will automatically terminate the old instances.(in your case old launch config instances)
OldestInstance. Auto Scaling terminates the oldest instance in the group. This option is useful when you're upgrading the instances in the Auto Scaling group to a new EC2 instance type. You can gradually replace instances of the old type with instances of the new type.

AWS Is it possible to automatically terminate and recreate new instances for an auto scaling group periodically?

We have an AWS scaling group that has 10-20 servers behind a load balancer. After running for a couple of weeks some these server go bad. We have no idea why the servers go bad and it will take some time for us to get to a stage where we can debug this issue.
In the interim is there a way to tell AWS to terminate all the instances in the scaling group in a controlled fashion (one by one) until all the instances are replaced by new ones every week or so?
You can achieve this very effectively using Data Pipeline.
This is the developer guide for How do I stop and start Amazon EC2 Instances at scheduled intervals with AWS Data Pipeline?
There is no function in Auto Scaling to tell it to automatically terminate and replace instances. However, you could script such functionality.
Assumptions:
Terminate instances that are older than a certain number of hours old
Do them one-at-a-time to avoid impacting available capacity
You wish to replace them immediately
A suitable script would do the following:
Loop through all instances in a given Auto-Scaling Group using describe-auto-scaling-instances
If the instance belongs to the desired Auto Scaling group, retrieve its launch time via describe-instances
If the instance is older than the desired number of hours, terminate it using terminate-instance-in-auto-scaling-group with --no-should-decrement-desired-capacity so that it is automatically replaced
Then, wait a few minutes to allow it to be replaced and continue the loop
The script could be created by using the AWS Command-Line Interface (CLI) or a programming language such as Python.
Alternatively, you could program the instances to self-destruct after a given period of time (eg 72 hours) by simply calling the operating system to shut-down the instance. This would cause auto-scaling to terminate the instance and replace it.
There are two ways to achieve what you are looking for, Scheduled Auto Scaling Actions or take them one of the instances out of the ASG.
Scheduled Scaling
Scaling based on a schedule allows you to scale your application in response to predictable load changes. For example, every week the traffic to your web application starts to increase on Wednesday, remains high on Thursday, and starts to decrease on Friday. You can plan your scaling activities based on the predictable traffic patterns of your web application.
https://docs.aws.amazon.com/autoscaling/latest/userguide/schedule_time.html
You most likely want this.
Auto Scaling enables you to put an instance that is in the InService state into the Standby state, update or troubleshoot the instance, and then return the instance to service. Instances that are on standby are still part of the Auto Scaling group, but they do not actively handle application traffic.
https://docs.aws.amazon.com/autoscaling/latest/userguide/as-enter-exit-standby.html
As of Nov 20, 2019, EC2 AutoScaling supports Max Instance Lifetime: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-ec2-auto-scaling-supports-max-instance-lifetime/
From:
The maximum instance lifetime specifies the maximum amount of time (in
seconds) that an instance can be in service. The maximum duration
applies to all current and future instances in the group. As an
instance approaches its maximum duration, it is terminated and
replaced, and cannot be used again.
When configuring the maximum instance lifetime for your Auto Scaling
group, you must specify a value of at least 86,400 seconds (1 day). To
clear a previously set value, specify a new value of 0.

AWS - EC2 stop instances not working properly

I'm very new on AWS and now I have two EC2 instances. In order to avoid waste the free tier plan I'm trying to stop instances when I'm not working with them.
This is what my EC2 Management console shows. As you can see there are two instances running and two instances terminated. I did not terminate swipe-dev, I have just stop it. But for any reason now is terminated plus new same instance with same source code was started. Why?
What I'm doing wrong? I just want stop instances.
Edit
I have decide keep just one project so I terminate eb-flask-demo-dev and stop swipe-dev instance. After few minutes instace state was stoped and I thought finally everything is fine. But I rejoin to EC2 console and this is what it shows.
Why swipe-dev is running again? and Why there is another terminated instance?
This is possible if your instance is a member of an Autoscaling group with desired capacity = 1 and set minimum size to 0. To maintain the number of healthy instances, Auto Scaling performs a periodic health check on running instances within an Auto Scaling group. When it finds that an instance is unhealthy (in this case because you stopped it), it terminates that instance and launches a new one.

AWS AutoScaling 'oldestinstance' Termination Policy does not always terminate oldest instances

Scenario
I am creating a script that will launch new instances into an AutoScaling Group and then remove the old instances. The purpose is to introduce newly created (or updated) AMI's to the AutoScaling Group. This is accomplished by increasing the Desired capacity by double the current number of instances. Then, after the new instances are Running, decreasing the Desired capacity by the same number.
Problem
When I run the script, I watch the group capacity increase by double, the new instances come online, they reach the Running state, and then the group capacity is decreased. Works like a charm. The problem is that SOMETIMES the instances that are terminated by the decrease are actually the new ones instead of the older ones.
Question
How can I ensure that the AutoScaling Group will always terminate the Oldest Instance?
Settings
The AutoScaling Group has the following Termination Polices: OldestInstance, OldestLaunchConfiguration. The Default policy has been removed.
The Default Cooldown is set to 0 seconds.
The Group only has one Availability Zone.
Troubleshooting
I played around with the Cooldown setting. Ended up just putting it on 0.
I waited different lengths of time to see if the existing servers needed to be running for a certain amount of time before they would be terminated. It seems that if they are less than 5 minutes old, they are less likely to be terminated, but not always. I had servers that were 20 minutes old that were not terminated instead of the new ones. Perhaps newly launched instances have some termination protection grace period?
Concession
I know that in most cases, the servers I will be replacing will have been running for a long time. In production, this might not be an issue. Still, it is possible that during the normal course of AutoScaling, an older server will be left running instead of a newer one. This is not an acceptable way to operate.
I could force specific instances to terminate, but that would defeat the point of the OldestInstance Termination Policy.
Update: 12 Feb 2014
I have continued to see this in production. Instances with older launch configs that have been running for weeks will be left running while newer instances will be terminated. At this point I am considering this to be a bug. A thread at Amazon was opened for this topic a couple years ago, apparently without resolution.
Update: 21 Feb 2014
I have been working with AWS support staff and at this point they have preliminarily confirmed it could be a bug. They are researching the problem.
It doesn't look like you can, precisely, because auto-scaling is trying to do one other thing for you in addition to having the correct number of instances running: keep your instance counts balanced across availability zones... and it prioritizes this consideration higher than your termination policy.
Before Auto Scaling selects an instance to terminate, it first identifies the Availability Zone that has more instances than the other Availability Zones used by the group. If all Availability Zones have the same number of instances, it identifies a random Availability Zone. Within the identified Availability Zone, Auto Scaling uses the termination policy to select the instance for termination.
— http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/us-termination-policy.html
If you're out of balance, then staying in balance is arguably the most sensible strategy, especially if you are using ELB. The documentation is a little ambiguous, but ELB will advertise one public IP in the DNS for each availability zone where it is configured; these three IP addresses will achieve the first tier of load balancing by virtue of round-robin DNS. If all of the availability zones where the ELB is enabled have healthy instances, then there appears to be a 1:1 correlation between which external IP the traffic hits and which availability zone's servers that traffic will be offered to by ELB -- at least that is what my server logs show. It appears that ELB doesn't route traffic across availability zones to alternate servers unless all of the servers in a given zone are detected as unhealthy, and that may be one of the justifications of why they've implemented autoscaling this way.
Although this algorithm might not always kill the oldest instance first on a region-wide basis, if it does operate as documented, it would kill off the oldest one in the selected availability zone, and at some point it should end up cycling through all of them over the course of several shifts in load... so it would not leave the oldest running indefinitely, either. The larger the number of instances in the group is, it seems like the less significant this effect should be.
There are a couple of other ways to do it:
Increase desired to 2x
Wait for action to increase capacity
When the new instances are running, suspend all AS activity (as-suspend-processes MyAutoScalingGroup)
Reset desired
Terminate old instances
Resume AS activity.
Or:
Bring up a brand new ASG with the new launch config.
Suspend AS activity , until 1. is finished.
If everything is ok, delete the old ASG.
Resume AS activity
For ultimate rollback deployment:
Create new ELB (might have to ask Amazon to provision more elb if you have a lot of traffic, this is kinda lame and makes it not automation friendly)
Create new ASG with new LC
Switch DNS to new ELB
Delete old ELB/ASG/LC if everything's fine, if not just change DNS back
Or with the new ASG API that lets you attach/detach instances from ASG:
Somehow bring up your new instances (could just be run-instances or create a temp asg)
Suspend AS activity, Use http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/attach-instance-asg.html to attach them to your old ASG,
Use http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/detach-instance-asg.html or terminate your old instances
Resume AS activity
The reason you might want to use your old ASG is because it can be a pita to set all the policies again (even when automated) and it feels a bit safer to change as little as possible.
A.
My use case is that we needed to scale down and be able to choose which machines go down. Unfortunately the termination policy "OldestFirst" was not working for us either. I was able to use a variant of the attach/detach method that ambakshi shared to remove the oldest (or any instance I choose) and at the same time lower the desired instances value of the autoscaling group.
Step 1 – Change the autoscaling group Min value to the number you want to scale down to.
Step 2 – Suspend the ASG
Step 3 – Detach the instances you want to terminate, you can do multiple instances in one command. Make sure to use the should-decrement-desired-capacity flag
Step 4 – Resume the ASG
Step 5 – Terminate your instances using the console or the CLI
UPDATE
There is no need to suspend the Auto Scaling Group, just doing steps 1, 3 and 5 worked for me. Just be aware of any availability zone balancing that may happen.