Does ELB connection draining apply when spot instances are terminated? - amazon-web-services

A new AWS ELB feature, connection draining, was recently announced.
http://aws.amazon.com/about-aws/whats-new/2014/03/20/elastic-load-balancing-supports-connection-draining/
Apparently this works with Auto Scaling Groups - instances are drained before being removed, but does that also apply to spot instances that are being terminated by AWS due to a rising spot price?

Nothing definitive I could find, but from my reading on this, I think the answer is almost definitely no. Spot instances are different animals than regular instances, and the way the connection draining works you can specify upto 60 minute delay before your connection-drained enabled instance gets terminated when it becomes unhealthy - if AWS was to allow this added layer of safety to spot instances, it would completely up-end the way spot instances are used and how they are positioned.
The trade-off for using spot instances has always been, "you can pay a fraction of the cost, but you risk being terminated at any instant without warning"...if they added an up to 60 minute 'warning' to spot instances, while it would be fantastic from the end-users point of view, I think it would severely eat into AWS's on-demand and reserved instance pricing model and thus they probably won't support this anytime soon (unless forced to by competitive pressure).
EDIT 1/6/2015: now, almost a year later, AWS has indeed added a 'two minute warning' for EC2 spot instance termination. https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/

Related

aws spot instances are evicted much sooner and more often than expected

I am trying to use aws spot instances (m5.large in eu-west-2 region) with a maximum bid equal to the price of on demand instances. According to https://aws.amazon.com/ec2/spot/instance-advisor/ these instances should have a < 5% frequency of interruption, however, after launching 40 such instances this morning, I have found that within the hour 34 of them were evicted by aws ("instance-terminated-no-capacity" according to the spot requests page on the ec2 dashboard).
This eviction rate looks much too high compared to both amazon's own advisor and other users experiences. Does anybody know what could be causing this behaviour, if there is any better way to debug it or predict it, or if this is just what I should expect from spot instances?
Thank you!
Actually, for m5.large instance in eu-west-2 region(Oregon) it's 5%-10% frequency of interruption, so you can expect a max of 10%. I'm not saying that issue you are facing is because of this.
AWS terminates your spot instances because of any of these reasons,
The Spot price is above the maximum price.
There isn't enough capacity.
Amazon EC2 can't meet the constraints you placed on your Spot request.
In your case, since you are seeing instance-terminated-no-capacity message it is definitely because of the second reason. Since you've asked for 40 such instances, the amazon spot instance pool might not have enough capacity at that time.
The capacity of available spot instances pool depends on the demand for regular instances, and when users ask for regular on-demand instances, AWS will start terminating spot instances to fulfil those requests if there is not enough capacity
In my experience spot interruption is highly variable, and for some instances more or less likely at different times of the day.
If you need 40 instances and they do not need to be in the same availability zone (AZ) to each other you might reduce the chance of a mass interruption of all/most instances if you spread the machines across different AZs within the region as each availability zone has its own pool of machines. Although you will likely increase the chance that some machines will be interrupted.
Note this is not an option if you are using EMR, then they have to be in the same AZ.

AWS Spot/OnDemand Instance Management

Is there a way to elegantly Script/Configure Spot instances request, if Spot not available in some specified duration, just use OnDemand. And if Spot instance gets terminated just shift to OnDemand.
Spot Fleet does not do this (it just manages only Spot), EMR fleets have some logic around this. You can have auto scaling with Spot or on Demand not both (even though you can have 2 separate ASGs simulate this behavior).
This should be some kind of a base line use case.
Also does an Event get triggered when a Spot instance is launched or when it is Terminated. I am only seeing CLIs to check Spot status, not any CloudWatch metric/event.
Cloudwatch Instance State events can fire when any event changes states.
They can fire for any event in the lifecycle of an instance:
pending (launching), running (launch complete), shutting-down, stopped, stopping, and terminated, for any instance (or for all instances, which is probably what you want -- just disregard any instance that isn't of interest), and this includes both on-demand and spot.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/EventTypes.html#ec2_event_type
http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/LogEC2InstanceState.html
You could use this to roll your own solution -- there's not a built in mechanism for marshaling mixed fleets.
I used to do this from the ELB with health checks. You can make two groups, one with spot instances and one with reserved or on demand. Create a CW alarm when spot group contains zero healthy hosts, and scale up the other group when it fires. And the other way around, when it has enough healthy hosts scale down the other group. Use 30 sec health checks on alarm you use to scale up and 30-60 minute cooldown on scale down.
There is also Spotml which allows you to always keep a spotInstance or an onDemand instance up and running.
In addition to simply spawning the instance it also allows you to
Preserve data via persistent storage
And configure a startup script each time a new instance is spawned.
Disclosure: I'm also the creator of SpotML, it's primarily useful for ML/DataScience workflows that can largely just run on spot instances.

Is it possible that I request an EC2 instance but can't get fulfilled?

Edit: This question is NOT to ask about "Spot Instances"; this question is to ask regular "On Demand Instance". I think I need to clarify this, after reading comments below.
Basically, my question is about whether I should consider the risk that when I need to launch an EC2 instance, but that EC2 region has run out of capacity and can't fulfill my request.
I understand the chance for the above situation is extremely low, but I'd like to understand if AWS has any SLA to make sure that situation/risk won't happen.
There are protective controls in place to make the unavailability of a particular type of instance in a particular availability zone at a particular time unlikely... but it's possible, and there is no assurance provided by AWS that a given type of EC2 instance will be available for launch, on-demand, at any particular time, in any particular Availability Zone unless you purchased reserved instances of that type, specifically in that availability zone. In that case, there is supposed to always be sufficient hardware available so that you can have the number of paid reserved instances running, at minimum, including the ability to launch enough new instances to bring the total up to that minimum.
Reserved instances are commonly discussed in the context of their associated discount, but they have two purposes:
Reserved Instances are not physical instances, but rather a billing discount applied to the use of On-Demand Instances in your account. These On-Demand Instances must match certain attributes in order to benefit from the billing discount.
When you purchase Reserved Instances in a specific Availability Zone, they provide a capacity reservation. (emphasis added)
For example, if you purchased 4 reserved t2.2xlarge instances in us-east-2a, the assurance is that you will always be able to launch enough to bring the total running instances of that type in that zone to 4. If you already have 4, there's not an assurance of being able to start more, but there is an assurance that if you stop them, you will be able to start them again.
Pricing models for reserved instances has changed over the years, such that reserved instances are generally billed at the same rate whether they're running or not, so you can look at it one of two ways:
If you need the capacity all the time, you're getting a substantial discount... or, if you don't need the capacity all the time, you're technically paying all the time for capacity that is largely unneeded, but you are still paying less than you'd pay for on-demand instances without reservations, and you can either leave it running or launch it when you need it.
Should you consider the risk that an entire region has widespread capacity issues? You should consider it, but there are, historically speaking, other significant outage scenarios that are more likely... EBS and S3 have both had failures that impaired the ability to launch instances, even though the capacity was idle in EC2.
Yes. I've had API calls to create EC2 resources fail many times due to lack of available AWS resources. I most commonly see this when attempting to create a new EC2 instance with Dedicated Tenancy in a specific Availability Zone.
Yes. It is possible your instance request cannot be fulfilled. On-Demand instance does not guarantee you an instance. In particular,t2.small instances are more likely to be not fulfilled based on my experience. It is possible, AWS has only limited number of t2.small instances.
How can you make sure it is always fulfilled?
Reserve the instances for you so that it is not given to anyone else. But there is a cost associated with it. You have pay for the instance irrespective of whether you use it not. I am talking in general terms. Reserved instance is a complex topic, but that is the route you should take if you want AWS to guarantee you an instance.
The answer is yes, your launch request can fail because there is no available capacity in the relevant Availability Zone. I would say that it's a rare occurrence, but certainly possible.
You can mitigate by using multiple zones in the same region, other regions, or by using Reserved Instances.
Today it happened in an AWS account that I manage. If I am not wrong it was the family r4, exactly an r4.xlarge instance (4 hours ago) in Virgina Region. I had to choose a different AZ. That is why AWS always recommends working with more than one AZ.
But for that reason, I started to use Scheduled Reserved Instances
You always will have for a period of time a reserved capacity of a family type of instance(s).
Sure, it works if you have a defined schedule or workflow.
Hope it helps you.

Configure ECS to use Reserved Instances

ECS (EC2 Container Service) works with Auto Scaling Groups. But is it possible to use ECS with Reserved Instances? I ask this because Reserved Instances are cheaper.
When you reserve an instance, you get billed for all possible hours of use for the reservation period at the beginning of the period, whether you use the hours or not. At the end of the billing period, AWS will reconcile your usage hours vs. reserved hours.
This means that regardless of whether you stopped or started the instance automatically, you're still paying for it, albeit at the lower reserved price. Reserved instances do make sense for the 'non-elastic' portion of your ASG, like instance 1 of 3 that always needs to be up. If you have a lot of the same instance type, you can also 'float' reserved instance hours by estimating capacity and making your best guess as to how many auto-scaled hours you will end up needing.
This approach is good if you have an established track record of use, like peak load between 7AM and 5PM PST. You can do the math from there.
My preference for elasticity in ECS is to use the spot market. You will pay much much much less for them and autoscaling works great.

AWS AutoScaling 'oldestinstance' Termination Policy does not always terminate oldest instances

Scenario
I am creating a script that will launch new instances into an AutoScaling Group and then remove the old instances. The purpose is to introduce newly created (or updated) AMI's to the AutoScaling Group. This is accomplished by increasing the Desired capacity by double the current number of instances. Then, after the new instances are Running, decreasing the Desired capacity by the same number.
Problem
When I run the script, I watch the group capacity increase by double, the new instances come online, they reach the Running state, and then the group capacity is decreased. Works like a charm. The problem is that SOMETIMES the instances that are terminated by the decrease are actually the new ones instead of the older ones.
Question
How can I ensure that the AutoScaling Group will always terminate the Oldest Instance?
Settings
The AutoScaling Group has the following Termination Polices: OldestInstance, OldestLaunchConfiguration. The Default policy has been removed.
The Default Cooldown is set to 0 seconds.
The Group only has one Availability Zone.
Troubleshooting
I played around with the Cooldown setting. Ended up just putting it on 0.
I waited different lengths of time to see if the existing servers needed to be running for a certain amount of time before they would be terminated. It seems that if they are less than 5 minutes old, they are less likely to be terminated, but not always. I had servers that were 20 minutes old that were not terminated instead of the new ones. Perhaps newly launched instances have some termination protection grace period?
Concession
I know that in most cases, the servers I will be replacing will have been running for a long time. In production, this might not be an issue. Still, it is possible that during the normal course of AutoScaling, an older server will be left running instead of a newer one. This is not an acceptable way to operate.
I could force specific instances to terminate, but that would defeat the point of the OldestInstance Termination Policy.
Update: 12 Feb 2014
I have continued to see this in production. Instances with older launch configs that have been running for weeks will be left running while newer instances will be terminated. At this point I am considering this to be a bug. A thread at Amazon was opened for this topic a couple years ago, apparently without resolution.
Update: 21 Feb 2014
I have been working with AWS support staff and at this point they have preliminarily confirmed it could be a bug. They are researching the problem.
It doesn't look like you can, precisely, because auto-scaling is trying to do one other thing for you in addition to having the correct number of instances running: keep your instance counts balanced across availability zones... and it prioritizes this consideration higher than your termination policy.
Before Auto Scaling selects an instance to terminate, it first identifies the Availability Zone that has more instances than the other Availability Zones used by the group. If all Availability Zones have the same number of instances, it identifies a random Availability Zone. Within the identified Availability Zone, Auto Scaling uses the termination policy to select the instance for termination.
— http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/us-termination-policy.html
If you're out of balance, then staying in balance is arguably the most sensible strategy, especially if you are using ELB. The documentation is a little ambiguous, but ELB will advertise one public IP in the DNS for each availability zone where it is configured; these three IP addresses will achieve the first tier of load balancing by virtue of round-robin DNS. If all of the availability zones where the ELB is enabled have healthy instances, then there appears to be a 1:1 correlation between which external IP the traffic hits and which availability zone's servers that traffic will be offered to by ELB -- at least that is what my server logs show. It appears that ELB doesn't route traffic across availability zones to alternate servers unless all of the servers in a given zone are detected as unhealthy, and that may be one of the justifications of why they've implemented autoscaling this way.
Although this algorithm might not always kill the oldest instance first on a region-wide basis, if it does operate as documented, it would kill off the oldest one in the selected availability zone, and at some point it should end up cycling through all of them over the course of several shifts in load... so it would not leave the oldest running indefinitely, either. The larger the number of instances in the group is, it seems like the less significant this effect should be.
There are a couple of other ways to do it:
Increase desired to 2x
Wait for action to increase capacity
When the new instances are running, suspend all AS activity (as-suspend-processes MyAutoScalingGroup)
Reset desired
Terminate old instances
Resume AS activity.
Or:
Bring up a brand new ASG with the new launch config.
Suspend AS activity , until 1. is finished.
If everything is ok, delete the old ASG.
Resume AS activity
For ultimate rollback deployment:
Create new ELB (might have to ask Amazon to provision more elb if you have a lot of traffic, this is kinda lame and makes it not automation friendly)
Create new ASG with new LC
Switch DNS to new ELB
Delete old ELB/ASG/LC if everything's fine, if not just change DNS back
Or with the new ASG API that lets you attach/detach instances from ASG:
Somehow bring up your new instances (could just be run-instances or create a temp asg)
Suspend AS activity, Use http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/attach-instance-asg.html to attach them to your old ASG,
Use http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/detach-instance-asg.html or terminate your old instances
Resume AS activity
The reason you might want to use your old ASG is because it can be a pita to set all the policies again (even when automated) and it feels a bit safer to change as little as possible.
A.
My use case is that we needed to scale down and be able to choose which machines go down. Unfortunately the termination policy "OldestFirst" was not working for us either. I was able to use a variant of the attach/detach method that ambakshi shared to remove the oldest (or any instance I choose) and at the same time lower the desired instances value of the autoscaling group.
Step 1 – Change the autoscaling group Min value to the number you want to scale down to.
Step 2 – Suspend the ASG
Step 3 – Detach the instances you want to terminate, you can do multiple instances in one command. Make sure to use the should-decrement-desired-capacity flag
Step 4 – Resume the ASG
Step 5 – Terminate your instances using the console or the CLI
UPDATE
There is no need to suspend the Auto Scaling Group, just doing steps 1, 3 and 5 worked for me. Just be aware of any availability zone balancing that may happen.