Delay ASG Scaling up trigger while machine boots - amazon-web-services

We have set up our Auto Scaling Group to scale up the number of instances when a certain load threshold is met.
The problem is that when the new instance has finished booting we have a bootstrap script that runs to configure the machine (the boot strap script launches puppet, which configures the machine accordingly).
During the run of this script (which can take a few minutes), the load on this machine is high, which causes the ASG to launch yet another machine, which in turn causes yet another instance to get created, etc. etc.
Is there a way to tell the ASG not to start collecting metrics from this machine until x amount of time has gone by (or better yet, when the boot strap script is done)?

You probably need to set the healthcheckgraceperiod higher:
Length of time in seconds after a new Amazon EC2 instance comes into service that Auto Scaling starts checking its health. During this time any health check failure for the that instance is ignored.
This is required if you are adding ELB health check. Frequently, new instances need to warm up, briefly, before they can pass a health check. To provide ample warm-up time, set the health check grace period of the group to match the expected startup period of your application.
http://docs.aws.amazon.com/AutoScaling/latest/APIReference/API_CreateAutoScalingGroup.html

In the meantime, there is such a setting for scaling policies: Instance warm-up
It tells the autoscaling system not to take newly launched instances' metrics into consideration for a specified amount of time (the warm up time).
Since it defaults to the ASG's "cooldown" time, there is correlation to that. You might need to adjust that as well to not trigger too many scaling activites just because your instances aren't ready yet. The documentation says it's not respected by target tracking or step scaling policies but I cannot confirm that in my tests and had to adjust it as well.

Related

AWS initial instance drains when auto-scaling

I'm trying to configure an AWS autoscaling setup (This is the first time I'm trying it). So far I have created an alarm to add a new instance when
CPU usage is more than 25%
Period 1
Data points 1 out of 1
Then I ren a JMeter script to generate a load to the machine. When the load is high, the alarm will go to in-Alarm state ( Approximate CPU usage is around 60% and memory 50% ).
My problem is,
Before the second instance is up and healthy, both my instances start draining. I expected something like this if the CPU usage is 100% or memory is 100%. When I stop the Jmeter script, both the instances will become healthy in a few minutes.
Then if I execute the same load again ( when both instances are up ), the system will run smoothly and even it will add the 3rd instance.
My problem is, what could be the reason for the draining of the working instance when the CPU usage is NOT 100%?
any idea?
The AutoScaling Group will show you the reason it decides to terminate instances in the Activity History (if you're still using the old console you have to press the dropdown arrow on the left side of the 'terminate' message).
I assume when you say its draining, you mean that AutoScaling is deregistering it from the load balancer, and is getting ready to terminate it. What I assume is happening is that the instance is failing ELB healthchecks, which the ASG sees and then marks the instance as unhealthy and terminates it. If you have an Application Load Balancer it will usually show the reason for healthcheck failures if you hove over the (i) next to the instance when you look at the target group's Instances tab
In general, the only reason that an ASG would terminate an instance is:
A manual change or scale in alarm caused the desired to go down
A healthcheck failure (EC2, or ELB if ELB healthchecks are enabled on the ASG)
Some sort of administrative task like AZ reballancing

How to execute a shell script as a result of an aws auto-scale event

Background
I got the following setup with AWS code deploy:
Currently we have our EC2 application servers connected to an auto-scaling group, but there is a missing step: once a new server is fired up, we don't automatically deploy the latest code on it from our git repo
Question
I was going over this tutorial:
Basically i want to run a bunch of commands as soon as an instance is launched but before it's hooked up to the load balancer.
The above tutorial describes things in general, but I couldn't answer the following questions:
Where do I save the script on the ec2 instance?
How is that script executed once the instance is scaled in but before its connected to the load balancer?
I think you do not need to life cycle hook, the life cycle is useful when you want to perform an action in different stats like stop, start and terminate but you just to pull the latest code and some other commands.
To answer your Question I will suggest below approach, as there are many many more approaches for the same task.
You do not need to save the script or command, place them on s3 or you can run commands just put them in the user data in your launch configuration. You can run them as bash script or you can pull your scripts from aws s3.
This can be the simplest example to handle pull code case. So this will run whenever a new instance launch in this auto-scaling group.
Another example can be to run a complex script, place them on s3 and pull them during scaling up.
I assume you already set permission for s3 and bitbucket. You can run any complex during this time.
The second steps are a bit tricky, you can use a different approach, the instance will never receive traffic until its healthy so start your application once your code updated and all the required scripts done execution than at the end you can run your application.
Another approach can be
a):Health Check Grace Period
Frequently, an Auto Scaling instance that has just come into service
needs to warm up before it can pass the health check. Amazon EC2
Auto Scaling waits until the health check grace period ends before
checking the health status of the instance.
b)Custom Health Checks
If you have your own health check system, you can send the instance's
health information directly from your system to Amazon EC2 Auto
Scaling.
Use the following set-instance-health command to set the health state
of the specified instance to Unhealthy.
aws autoscaling set-instance-health --instance-id i-123abc45d --health-status healthy
You can get instance-id using curl call, the script that we place in the userdata.
If you have custom health checks, you can send the information from your health checks to Amazon EC2 Auto Scaling so that Amazon EC2 Auto Scaling can use this information. For example, if you determine that an instance is not functioning as expected, you can set the health status of the instance to Unhealthy. The next time that Amazon EC2 Auto Scaling performs a health check on the instance, it will determine that the instance is unhealthy and then launch a replacement instance.
c)Instance Warmup
With step scaling policies, you can specify the number of seconds that
it takes for a newly launched instance to warm up. Until its specified
warm-up time has expired, an instance is not counted toward the
aggregated metrics of the Auto Scaling group. While scaling out, AWS
also does not consider instances that are warming up as part of the
current capacity of the group. Therefore, multiple alarm breaches that
fall in the range of the same step adjustment result in a single
scaling activity. This ensures that we don't add more instances than
you need.
Again, the second step is not that big deal, you can control the flow using your script and start the application at the end so then it will mark healthy,
You can also try as-enter-exit-standby but I think custom health checks for warm up can do this job.

AWS Spot/OnDemand Instance Management

Is there a way to elegantly Script/Configure Spot instances request, if Spot not available in some specified duration, just use OnDemand. And if Spot instance gets terminated just shift to OnDemand.
Spot Fleet does not do this (it just manages only Spot), EMR fleets have some logic around this. You can have auto scaling with Spot or on Demand not both (even though you can have 2 separate ASGs simulate this behavior).
This should be some kind of a base line use case.
Also does an Event get triggered when a Spot instance is launched or when it is Terminated. I am only seeing CLIs to check Spot status, not any CloudWatch metric/event.
Cloudwatch Instance State events can fire when any event changes states.
They can fire for any event in the lifecycle of an instance:
pending (launching), running (launch complete), shutting-down, stopped, stopping, and terminated, for any instance (or for all instances, which is probably what you want -- just disregard any instance that isn't of interest), and this includes both on-demand and spot.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#ec2_event_type
http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/LogEC2InstanceState.html
You could use this to roll your own solution -- there's not a built in mechanism for marshaling mixed fleets.
I used to do this from the ELB with health checks. You can make two groups, one with spot instances and one with reserved or on demand. Create a CW alarm when spot group contains zero healthy hosts, and scale up the other group when it fires. And the other way around, when it has enough healthy hosts scale down the other group. Use 30 sec health checks on alarm you use to scale up and 30-60 minute cooldown on scale down.
There is also Spotml which allows you to always keep a spotInstance or an onDemand instance up and running.
In addition to simply spawning the instance it also allows you to
Preserve data via persistent storage
And configure a startup script each time a new instance is spawned.
Disclosure: I'm also the creator of SpotML, it's primarily useful for ML/DataScience workflows that can largely just run on spot instances.

Automatically terminate Auto Scaling instances after a time period

We use Amazon EC2 Auto Scaling groups for a bunch of apps - as everyone knows, while you try your hardest not to have memory leaks and other "been up for a while problems" - it's still possible.
We'd like to protect against such possibilities by just bouncing the servers - ie make sure an instance only lasts, say, 24 hours before killing it. However, we want the killing to be "safe" - eg - even if there's only one instance in the group, we want to start up another instance to a working state, then kill the old box.
Is there any support for this? eg a time-to-live property on an instance?
There is no such property in Amazon EC2 nor in Auto Scaling.
You could manually set the instance health to Unhealthy, which would cause Auto Scaling to terminate and replace the instance. However, if you have only one instance then there will likely be a period where there are no instances.
You could set the Auto Scaling termination policy to OldestInstance, which means that when Auto scaling needs to terminate an instance, it will terminate the oldest instance within the AZ that has the most instances. This gets rid of old instances, but only when the group is scaled-in.
Therefore, you could supplement the Termination Policy with a script that scales-out the group and then causes it to scale-in again. For example, double the number of instances, wait for them to launch, and then halve the number of instances. This should cause them all to refresh (with a few edge conditions if your instances are spread across multiple AZs, causing non-even counts).
Another option is to restart the instance(s). This will not cause them to appear unhealthy to Auto Scaling, but they will appear unhealthy to a Load Balancer. (If you have activated ELB Health Checks within Auto Scaling, then Auto Scaling would actually terminate instances the fail the health check.) You can use Scheduled Events for Your Instances to have Amazon CloudWatch Events restart your instance(s) at certain intervals, or even have a script on the instance tell the Operating System to restart at certain intervals.
However, there is no automatic option to do exactly what you asked.
Since 2019, there has been a Maximum Instance Lifetime parameter, that almost does what you wanted.
Unfortunately, though, it isn’t possible to set the maximum instance lifetime to 24 hours (86400 seconds): the minimum is a week.
Maximum instance lifetime must be equal to 0, between 604800 and 31536000 seconds (inclusive), or not specified.

AWS Is it possible to automatically terminate and recreate new instances for an auto scaling group periodically?

We have an AWS scaling group that has 10-20 servers behind a load balancer. After running for a couple of weeks some these server go bad. We have no idea why the servers go bad and it will take some time for us to get to a stage where we can debug this issue.
In the interim is there a way to tell AWS to terminate all the instances in the scaling group in a controlled fashion (one by one) until all the instances are replaced by new ones every week or so?
You can achieve this very effectively using Data Pipeline.
This is the developer guide for How do I stop and start Amazon EC2 Instances at scheduled intervals with AWS Data Pipeline?
There is no function in Auto Scaling to tell it to automatically terminate and replace instances. However, you could script such functionality.
Assumptions:
Terminate instances that are older than a certain number of hours old
Do them one-at-a-time to avoid impacting available capacity
You wish to replace them immediately
A suitable script would do the following:
Loop through all instances in a given Auto-Scaling Group using describe-auto-scaling-instances
If the instance belongs to the desired Auto Scaling group, retrieve its launch time via describe-instances
If the instance is older than the desired number of hours, terminate it using terminate-instance-in-auto-scaling-group with --no-should-decrement-desired-capacity so that it is automatically replaced
Then, wait a few minutes to allow it to be replaced and continue the loop
The script could be created by using the AWS Command-Line Interface (CLI) or a programming language such as Python.
Alternatively, you could program the instances to self-destruct after a given period of time (eg 72 hours) by simply calling the operating system to shut-down the instance. This would cause auto-scaling to terminate the instance and replace it.
There are two ways to achieve what you are looking for, Scheduled Auto Scaling Actions or take them one of the instances out of the ASG.
Scheduled Scaling
Scaling based on a schedule allows you to scale your application in response to predictable load changes. For example, every week the traffic to your web application starts to increase on Wednesday, remains high on Thursday, and starts to decrease on Friday. You can plan your scaling activities based on the predictable traffic patterns of your web application.
https://docs.aws.amazon.com/autoscaling/latest/userguide/schedule_time.html
You most likely want this.
Auto Scaling enables you to put an instance that is in the InService state into the Standby state, update or troubleshoot the instance, and then return the instance to service. Instances that are on standby are still part of the Auto Scaling group, but they do not actively handle application traffic.
https://docs.aws.amazon.com/autoscaling/latest/userguide/as-enter-exit-standby.html
As of Nov 20, 2019, EC2 AutoScaling supports Max Instance Lifetime: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-ec2-auto-scaling-supports-max-instance-lifetime/
From:
The maximum instance lifetime specifies the maximum amount of time (in
seconds) that an instance can be in service. The maximum duration
applies to all current and future instances in the group. As an
instance approaches its maximum duration, it is terminated and
replaced, and cannot be used again.
When configuring the maximum instance lifetime for your Auto Scaling
group, you must specify a value of at least 86,400 seconds (1 day). To
clear a previously set value, specify a new value of 0.