AWS Spot/OnDemand Instance Management

AWS Spot/OnDemand Instance Management - amazon-web-services

Is there a way to elegantly Script/Configure Spot instances request, if Spot not available in some specified duration, just use OnDemand. And if Spot instance gets terminated just shift to OnDemand.
Spot Fleet does not do this (it just manages only Spot), EMR fleets have some logic around this. You can have auto scaling with Spot or on Demand not both (even though you can have 2 separate ASGs simulate this behavior).
This should be some kind of a base line use case.
Also does an Event get triggered when a Spot instance is launched or when it is Terminated. I am only seeing CLIs to check Spot status, not any CloudWatch metric/event.

Cloudwatch Instance State events can fire when any event changes states.
They can fire for any event in the lifecycle of an instance:
pending (launching), running (launch complete), shutting-down, stopped, stopping, and terminated, for any instance (or for all instances, which is probably what you want -- just disregard any instance that isn't of interest), and this includes both on-demand and spot.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#ec2_event_type
http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/LogEC2InstanceState.html
You could use this to roll your own solution -- there's not a built in mechanism for marshaling mixed fleets.

I used to do this from the ELB with health checks. You can make two groups, one with spot instances and one with reserved or on demand. Create a CW alarm when spot group contains zero healthy hosts, and scale up the other group when it fires. And the other way around, when it has enough healthy hosts scale down the other group. Use 30 sec health checks on alarm you use to scale up and 30-60 minute cooldown on scale down.

There is also Spotml which allows you to always keep a spotInstance or an onDemand instance up and running.
In addition to simply spawning the instance it also allows you to
Preserve data via persistent storage
And configure a startup script each time a new instance is spawned.
Disclosure: I'm also the creator of SpotML, it's primarily useful for ML/DataScience workflows that can largely just run on spot instances.

Related

AWS autoscale group scale in event

I am using autoscale group for adding and removing additional instances for my application. I am using CPU Utilization as my scaling parameter and wondering what happens when an instance is running a program and the CPU Utilization comes below 65% (i.e threshold value).
Does it wait for the instance to finish the program or terminate the instance at that moment? If it terminates the instance at that moment then it might lead to data loss/data inconsistency.
Any help would be appreciated.

If you're looking to prevent or delay an instance during a scale in event you could take a look at lifecycle hooks.
By enabling this autoscaling can send a notification that a specific instance action is about to occur (scale out or scale in). Using a combination of services (such as SNS, Lambda, SSM etc) you would be able to programmatically notify the instance that is is about to be terminated which you can then take any necessary actions.
The instance termination will wait until there is a confirmation to the autoscaling group that it has been completed which will lead to it being terminated. Additionally a lifecycle hook will have a timeout, if no confirmation is received by the time the timeout has been exceeded then the termination will still occur.

I think that you are looking for termination policy
Look at this link:
https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-instance-termination.html#default-termination-policy
And in my experience, the instance will be terminated no matter what it is running

Does it wait for the instance to finish the program or terminate the instance at that moment.
Sadly, I does not wait. ASG works outside of your instances and is not concerned with any programs running on your instance.
Having said that, there are few things you can do, some of which are described in:
Controlling which Auto Scaling instances terminate during scale in
Generally speaking you should develop your applications to be stateless. This means the applications should be "aware" that they can be terminated at any time. One way to achieve is by using external storage systems, such as S3 or EFS, which will persist data between terminations.
Other way, is to use termination protection. In this case, the application will put its instance in this state at the beginning of processing, and then whey the calcuation finishes, the termination protection will be removed.

Does an autoscaling group that contains spot instances respond to spot instance interruption notices?

I'm considering using spot instances in an auto-scaling group. While I'm aware that I'll receive a 'Spot instance interruption notice' if my spot instances are going to get terminated, what is unclear from the documentation is if my auto-scaling group will spin up new on-demand instances to replace these when the notice occurs, or if they only get replaced on termination. I'm aware that I could listen for these notices manually, but it seems like something that an auto-scaling group should be able to handle automatically.
I've tried testing this out on an existing auto-scaling group that had spot instances by changing the launch configurations 'spot price' to be lower than the current price. This did not work as it would only effect new instances and not currently running ones. I'm unsure of how to change an existing spot request's price.
What I'm hoping will happen is that on-demand instances will be spun up in the two minutes I have from the interruption notice till the time of termination.

If the Launch Configuration in your Auto Scaling Group is configured to use Spot instances then the new instance will indeed be a Spot instance.
The situation you describe is one of the challenges of using Spot instances; although the cost is very low, Spot instances can be terminated and the underlying resources used for a paying customer to fulfill an on-demand or Reserved Instance at anytime.
One way to avoid this is to use Reserved Instances. If you have a predictable long-term need for an instance, or are running a production workload, using Reserved instances is an effective way to lower your costs (albeit, not as low as a spot instance) without having to worry that you could lose your instance(s) at anytime.
Regarding changing the price, updates to pricing are applied to new instances only. After updating pricing simply terminate your existing instances and they’ll be replaced by your ASG with instances at the new price.

Automatically terminate Auto Scaling instances after a time period

We use Amazon EC2 Auto Scaling groups for a bunch of apps - as everyone knows, while you try your hardest not to have memory leaks and other "been up for a while problems" - it's still possible.
We'd like to protect against such possibilities by just bouncing the servers - ie make sure an instance only lasts, say, 24 hours before killing it. However, we want the killing to be "safe" - eg - even if there's only one instance in the group, we want to start up another instance to a working state, then kill the old box.
Is there any support for this? eg a time-to-live property on an instance?

There is no such property in Amazon EC2 nor in Auto Scaling.
You could manually set the instance health to Unhealthy, which would cause Auto Scaling to terminate and replace the instance. However, if you have only one instance then there will likely be a period where there are no instances.
You could set the Auto Scaling termination policy to OldestInstance, which means that when Auto scaling needs to terminate an instance, it will terminate the oldest instance within the AZ that has the most instances. This gets rid of old instances, but only when the group is scaled-in.
Therefore, you could supplement the Termination Policy with a script that scales-out the group and then causes it to scale-in again. For example, double the number of instances, wait for them to launch, and then halve the number of instances. This should cause them all to refresh (with a few edge conditions if your instances are spread across multiple AZs, causing non-even counts).
Another option is to restart the instance(s). This will not cause them to appear unhealthy to Auto Scaling, but they will appear unhealthy to a Load Balancer. (If you have activated ELB Health Checks within Auto Scaling, then Auto Scaling would actually terminate instances the fail the health check.) You can use Scheduled Events for Your Instances to have Amazon CloudWatch Events restart your instance(s) at certain intervals, or even have a script on the instance tell the Operating System to restart at certain intervals.
However, there is no automatic option to do exactly what you asked.

Since 2019, there has been a Maximum Instance Lifetime parameter, that almost does what you wanted.
Unfortunately, though, it isn’t possible to set the maximum instance lifetime to 24 hours (86400 seconds): the minimum is a week.
Maximum instance lifetime must be equal to 0, between 604800 and 31536000 seconds (inclusive), or not specified.

AWS - how to prevent load balancer from terminating instances under load?

I'm writing a web-service that packs up customer data into zip-files, then uploads them to S3 for download. It is an on-demand process, and the amount of data can range from a few Megabytes to multiple Gigabytes, depending on what data the customer orders.
Needless to say, scalability is essential for such a service. But I'm having trouble with it. Packaging the data into zip-files has to be done on the local harddrive of a server instance.
But the load balancer is prone to terminating instances that are still working. I have taken a look at scaling policies:
http://docs.aws.amazon.com/autoscaling/latest/userguide/as-instance-termination.html
But what I need doesn't seem to be there. The issue shouldn't be so difficult: I set the scale metric to CPU load, and scale down when it goes under 1%. But I need a guarantee that the exact instance will be terminated that breached the threshold, not another one that's still hard at work, and the available policies don't seem to present me with that option. Right now, I am at a loss how to achieve this. Can anybody give me some advice?

You can use Auto Scaling Lifecycle Hooks to perform actions before an instance is terminated. You could use this to wait for the processing to finish before proceeding with the instance termination.

It appears that you have configured an Auto Scaling group with scaling policies based upon CPU Utilization.
Please note that an Elastic Load Balancer will never terminate an Amazon EC2 instance -- if a Load Balancer health check fails, it will merely stop serving traffic to that EC2 instance until it again passes the health checks. It is possible to configure Auto Scaling to use ELB health checks, in which case Auto Scaling will terminate any instances that ELB marks as unhealthy.
Therefore, it would appear that Auto Scaling is responsible for terminating your instances, as a result of your scaling policies. You say that you wish to terminate specific instances that are unused. However, this is not the general intention of Auto Scaling. Rather, Auto Scaling is used to provide a pool of resources that can be scaled by launching new instances and terminating unwanted instances. Metrics that trigger Auto Scaling are typically based upon aggregate metrics across the whole Auto Scaling group (eg average CPU Utilization).
Given that Amazon EC2 instances are charged by the hour, it is often a good idea to keep instance running longer -- "Scale Out quickly, Scale In slowly".
Once Auto Scaling decides to terminate an instance (which it selects via a termination policy), use an Auto Scaling lifecycle hook to delay the termination until ready (eg, copying log files to S3, or waiting for a long process to complete).
If you do wish to terminate an instance after it has completed a particular workload, there is no need to use Auto Scaling -- just have the instance Shutdown when it is finished, and set the Shutdown Behavior to terminate to automatically terminate the instance upon shutdown. (This assumes that you have a process to launch new instances when you have work you wish to perform.)
Stepping back and looking at your total architecture, it would appear that you have a Load Balancer in front of web servers, and you are performing the Zip operations on the web servers? This is not a scalable solution. It would be better if your web servers pushed a message into an Amazon Simple Queue Service (SQS) queue, and then your fleet of back-end servers processed messages from the queue. This way, your front-end can continue receiving requests regardless of the amount of processing underway.

It sounds like what you need is Instance Protection, which is actually mentioned a bit more towards the bottom of the document that you linked to. As long as you have work being performed on a particular instance, it should not be automatically terminated by the Auto-Scaling Group (ASG).
Check out this blog post, on the official AWS blog, that conceptually talks about how you can use Instance Protection to prevent work from being prematurely terminated.

AWS Is it possible to automatically terminate and recreate new instances for an auto scaling group periodically?

We have an AWS scaling group that has 10-20 servers behind a load balancer. After running for a couple of weeks some these server go bad. We have no idea why the servers go bad and it will take some time for us to get to a stage where we can debug this issue.
In the interim is there a way to tell AWS to terminate all the instances in the scaling group in a controlled fashion (one by one) until all the instances are replaced by new ones every week or so?

You can achieve this very effectively using Data Pipeline.
This is the developer guide for How do I stop and start Amazon EC2 Instances at scheduled intervals with AWS Data Pipeline?

There is no function in Auto Scaling to tell it to automatically terminate and replace instances. However, you could script such functionality.
Assumptions:
Terminate instances that are older than a certain number of hours old
Do them one-at-a-time to avoid impacting available capacity
You wish to replace them immediately
A suitable script would do the following:
Loop through all instances in a given Auto-Scaling Group using describe-auto-scaling-instances
If the instance belongs to the desired Auto Scaling group, retrieve its launch time via describe-instances
If the instance is older than the desired number of hours, terminate it using terminate-instance-in-auto-scaling-group with --no-should-decrement-desired-capacity so that it is automatically replaced
Then, wait a few minutes to allow it to be replaced and continue the loop
The script could be created by using the AWS Command-Line Interface (CLI) or a programming language such as Python.
Alternatively, you could program the instances to self-destruct after a given period of time (eg 72 hours) by simply calling the operating system to shut-down the instance. This would cause auto-scaling to terminate the instance and replace it.

There are two ways to achieve what you are looking for, Scheduled Auto Scaling Actions or take them one of the instances out of the ASG.
Scheduled Scaling
Scaling based on a schedule allows you to scale your application in response to predictable load changes. For example, every week the traffic to your web application starts to increase on Wednesday, remains high on Thursday, and starts to decrease on Friday. You can plan your scaling activities based on the predictable traffic patterns of your web application.
https://docs.aws.amazon.com/autoscaling/latest/userguide/schedule_time.html
You most likely want this.
Auto Scaling enables you to put an instance that is in the InService state into the Standby state, update or troubleshoot the instance, and then return the instance to service. Instances that are on standby are still part of the Auto Scaling group, but they do not actively handle application traffic.
https://docs.aws.amazon.com/autoscaling/latest/userguide/as-enter-exit-standby.html

As of Nov 20, 2019, EC2 AutoScaling supports Max Instance Lifetime: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-ec2-auto-scaling-supports-max-instance-lifetime/
From:
The maximum instance lifetime specifies the maximum amount of time (in
seconds) that an instance can be in service. The maximum duration
applies to all current and future instances in the group. As an
instance approaches its maximum duration, it is terminated and
replaced, and cannot be used again.
When configuring the maximum instance lifetime for your Auto Scaling
group, you must specify a value of at least 86,400 seconds (1 day). To
clear a previously set value, specify a new value of 0.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js