aws spot instances are evicted much sooner and more often than expected - amazon-web-services

I am trying to use aws spot instances (m5.large in eu-west-2 region) with a maximum bid equal to the price of on demand instances. According to https://aws.amazon.com/ec2/spot/instance-advisor/ these instances should have a < 5% frequency of interruption, however, after launching 40 such instances this morning, I have found that within the hour 34 of them were evicted by aws ("instance-terminated-no-capacity" according to the spot requests page on the ec2 dashboard).
This eviction rate looks much too high compared to both amazon's own advisor and other users experiences. Does anybody know what could be causing this behaviour, if there is any better way to debug it or predict it, or if this is just what I should expect from spot instances?
Thank you!

Actually, for m5.large instance in eu-west-2 region(Oregon) it's 5%-10% frequency of interruption, so you can expect a max of 10%. I'm not saying that issue you are facing is because of this.
AWS terminates your spot instances because of any of these reasons,
The Spot price is above the maximum price.
There isn't enough capacity.
Amazon EC2 can't meet the constraints you placed on your Spot request.
In your case, since you are seeing instance-terminated-no-capacity message it is definitely because of the second reason. Since you've asked for 40 such instances, the amazon spot instance pool might not have enough capacity at that time.
The capacity of available spot instances pool depends on the demand for regular instances, and when users ask for regular on-demand instances, AWS will start terminating spot instances to fulfil those requests if there is not enough capacity

In my experience spot interruption is highly variable, and for some instances more or less likely at different times of the day.
If you need 40 instances and they do not need to be in the same availability zone (AZ) to each other you might reduce the chance of a mass interruption of all/most instances if you spread the machines across different AZs within the region as each availability zone has its own pool of machines. Although you will likely increase the chance that some machines will be interrupted.
Note this is not an option if you are using EMR, then they have to be in the same AZ.

Related

I want AWS Spot pricing for a long-running job. Is a spot request of one instance the best way to achieve this?

I have a multi-day analysis problem that I am running on a 72 cpu c5n EC2 instance. To get spot pricing, I made my code interruption-resilient and am launching a spot request of one instance. It works great, but this seems like overkill given that Spot can handle thousands of instances. Is this the correct way to solve my problem or am I using a sledgehammer to squash a fly?
I've tried normal EC2 launching, which works great, except that it is four times the price. I don't know of any other way to approach this except for these two ways. I thought about Fargate or containers or something, but I am running a 72 cpu c5n node, and those other options won't let me use that kind of horsepower (that I know of, hence my question).
Thanks!
Amazon EC2 Spot Instances are an excellent way to get cheaper compute (up to 90% discount). The only downside is that the instances might be stopped/terminated (your choice) if there is insufficient capacity.
Some strategies to improve your chance of obtaining spot instances:
Use instances across different Instance Types and Availability Zones because they each have different availability pools (EC2 Spot Fleet can assist with this)
Use resources on weekends and in evenings (even in different regions!) because these tend to be times of lower usage
Use Spot Instances with a specified duration (also known as Spot blocks), but this is at a higher price and a maximum duration of 6 hours
If your software permits it, you could split your load between multiple instances to get the job done faster and to be more resilient against any stoppages of your Spot instances.
Hopefully your application is taking advantage of all the CPUs, otherwise you'd be better-off with smaller instances.

AWS EC2 - Clarification on number of instances

This is from the amazon ec2 FAQ :
Q: How quickly can I scale my capacity both up and down?
Amazon EC2 provides a truly elastic computing environment. Amazon EC2 enables you to increase or decrease capacity within minutes, not hours or days. You can commission one, hundreds or even thousands of server instances simultaneously. When you need more instances, you simply call RunInstances, and Amazon EC2 will typically set up your new instances in a matter of minutes. Of course, because this is all controlled with web service APIs, your application can automatically scale itself up and down depending on its needs.
Now again as per the same FAQ, I am only allowed to launch 20 instances per region. They said, I have to fill in a request form if I need more than 20 instances. So, in effect, I cant spin up more than 20 programmatically ?
What am I missing here ? how can we launch 100 instances let alone thousands. Sorry if this is the wrong place for such a question.
You cannot launch instances beyond the instance limit. You need to make a request to increase the instance limit. This is a safety feature so that:
A wild loop in your SDK/API script does not launch instances continuously
A malicious user does not launch a large number of instances
A hacker gets access to your account and launches a large number of instances
An incorrectly configured autoscaling group launches huge number of instances
If you require more than your instance limit, you need to submit a request to AWS. See: Amazon EC2 Service Limits. AWS will review your request and approve it.
You are missing the fact that limit increase requests are very easy to make and are almost always granted with no questions asked within a day or two.
To request a limit increase:
Open the AWS Support Center page, sign in if necessary, and choose Create Case.
For Regarding, choose Service Limit Increase.
Complete Limit Type, Use Case Description, and Contact method. If this request is urgent, choose Phone as the method of contact instead of Web.
Choose Submit.
AWS faqs provides a clear answer
You are limited to running up to a total of 20 On-Demand instances across the instance family, purchasing 20 Reserved Instances, and requesting Spot Instances per your dynamic Spot limit per region. New AWS accounts may start with limits that are lower than the limits described here. Certain instance types are further limited per region as follows
For Spot instance limits AWS states
The usual Amazon EC2 limits apply to instances launched by a Spot Fleet, such as Spot request price limits, instance limits, and volume limits. In addition, the following limits apply:
The number of active Spot Fleets per region: 1,000
The number of launch specifications per fleet: 50
The size of the user data in a launch specification: 16 KB
The target capacity per Spot Fleet: 3,000
The target capacity across all Spot Fleets in a region: 5,000
A Spot Fleet request can't span regions.
A Spot Fleet request can't span different subnets from the same Availability Zone.
These limits protect you from a hacker attack, stolen API keys, ETC. If you want to increase these limits, you need to send a form to AWS support team: AWS Support Center

reduce price by on AWS (EC2 and spot instances)

I have a queue of jobs and running AWS EC2 instances which process the jobs. We have an AutoScaling groups for each c4.* instance type in spot and on-demand version.
Each instance has power which is a number equal to number of instances CPUs. (for example c4.large has power=2 since it has 2 CPUs).
The the exact power we need is simply calculated from the number of jobs in the queue.
I would like to implement an algorithm which would periodically check the number of jobs in the queue and change the desired value of the particular AutoScaling groups by AWS SDK to save as much money as possible and maintain the total power of instances to keep jobs processed.
Especially:
I prefer spot instances to on-demand since they are cheaper
EC2 instances are charged per hour, we would like to turn off the instance only at the very last minute of its 1hour uptime.
We would like to replace on-demand instance by spot instances when possible. So, at 55min increase spot-group, at 58 check the new spot instance is running and if yes, decrease on-demand-group.
We would like to replace spot instances by on-demand if the bid would be too high. Just turn off the on-demand one and turn on the spot one.
Seems the problem is really difficult to handle. Anybody have any experience or a similar solution implemented?
You could certainly write your own code to do this, effectively telling your Auto Scaling groups when to add/remove instances.
Also, please note that a good strategy for lowering costs with Spot Instances is to appreciate that the price for a spot instance varies by:
Region
Availability Zone
Instance Type
So, if the spot price for a c4.xlarge goes up in one AZ, it might still be the same cost in another AZ. Also, the price of a c4.2xlarge might then be lower than a c4.xlarge, with twice the power.
Therefore, you should aim to diversity your spot instances across multiple AZs and multiple instance types. This means that spot price changes will impact only a small portion of your fleet rather than all at once.
You could use Spot Fleet to assist with this, or even third-party products such as SpotInst.
It's also worth looking at AWS Batch (not currently available in every region), which is designed to intelligently provide capacity for batch jobs.
Autoscaling groups allow you to use alarms and metrics that are defined outside of the autoscaling group.
If you are using SNS, you should be able to set up an alarm on your SNS queue and use that to scale up and scale down your scaling group.
If you are using a custom queue system, you can push metrics to cloudwatch to create a similar alarm.
You can determine how often scaling actions do occur, but it may be difficult to get the run time to exactly one hour.

Is it possible that I request an EC2 instance but can't get fulfilled?

Edit: This question is NOT to ask about "Spot Instances"; this question is to ask regular "On Demand Instance". I think I need to clarify this, after reading comments below.
Basically, my question is about whether I should consider the risk that when I need to launch an EC2 instance, but that EC2 region has run out of capacity and can't fulfill my request.
I understand the chance for the above situation is extremely low, but I'd like to understand if AWS has any SLA to make sure that situation/risk won't happen.
There are protective controls in place to make the unavailability of a particular type of instance in a particular availability zone at a particular time unlikely... but it's possible, and there is no assurance provided by AWS that a given type of EC2 instance will be available for launch, on-demand, at any particular time, in any particular Availability Zone unless you purchased reserved instances of that type, specifically in that availability zone. In that case, there is supposed to always be sufficient hardware available so that you can have the number of paid reserved instances running, at minimum, including the ability to launch enough new instances to bring the total up to that minimum.
Reserved instances are commonly discussed in the context of their associated discount, but they have two purposes:
Reserved Instances are not physical instances, but rather a billing discount applied to the use of On-Demand Instances in your account. These On-Demand Instances must match certain attributes in order to benefit from the billing discount.
When you purchase Reserved Instances in a specific Availability Zone, they provide a capacity reservation. (emphasis added)
For example, if you purchased 4 reserved t2.2xlarge instances in us-east-2a, the assurance is that you will always be able to launch enough to bring the total running instances of that type in that zone to 4. If you already have 4, there's not an assurance of being able to start more, but there is an assurance that if you stop them, you will be able to start them again.
Pricing models for reserved instances has changed over the years, such that reserved instances are generally billed at the same rate whether they're running or not, so you can look at it one of two ways:
If you need the capacity all the time, you're getting a substantial discount... or, if you don't need the capacity all the time, you're technically paying all the time for capacity that is largely unneeded, but you are still paying less than you'd pay for on-demand instances without reservations, and you can either leave it running or launch it when you need it.
Should you consider the risk that an entire region has widespread capacity issues? You should consider it, but there are, historically speaking, other significant outage scenarios that are more likely... EBS and S3 have both had failures that impaired the ability to launch instances, even though the capacity was idle in EC2.
Yes. I've had API calls to create EC2 resources fail many times due to lack of available AWS resources. I most commonly see this when attempting to create a new EC2 instance with Dedicated Tenancy in a specific Availability Zone.
Yes. It is possible your instance request cannot be fulfilled. On-Demand instance does not guarantee you an instance. In particular,t2.small instances are more likely to be not fulfilled based on my experience. It is possible, AWS has only limited number of t2.small instances.
How can you make sure it is always fulfilled?
Reserve the instances for you so that it is not given to anyone else. But there is a cost associated with it. You have pay for the instance irrespective of whether you use it not. I am talking in general terms. Reserved instance is a complex topic, but that is the route you should take if you want AWS to guarantee you an instance.
The answer is yes, your launch request can fail because there is no available capacity in the relevant Availability Zone. I would say that it's a rare occurrence, but certainly possible.
You can mitigate by using multiple zones in the same region, other regions, or by using Reserved Instances.
Today it happened in an AWS account that I manage. If I am not wrong it was the family r4, exactly an r4.xlarge instance (4 hours ago) in Virgina Region. I had to choose a different AZ. That is why AWS always recommends working with more than one AZ.
But for that reason, I started to use Scheduled Reserved Instances
You always will have for a period of time a reserved capacity of a family type of instance(s).
Sure, it works if you have a defined schedule or workflow.
Hope it helps you.

Does ELB connection draining apply when spot instances are terminated?

A new AWS ELB feature, connection draining, was recently announced.
http://aws.amazon.com/about-aws/whats-new/2014/03/20/elastic-load-balancing-supports-connection-draining/
Apparently this works with Auto Scaling Groups - instances are drained before being removed, but does that also apply to spot instances that are being terminated by AWS due to a rising spot price?
Nothing definitive I could find, but from my reading on this, I think the answer is almost definitely no. Spot instances are different animals than regular instances, and the way the connection draining works you can specify upto 60 minute delay before your connection-drained enabled instance gets terminated when it becomes unhealthy - if AWS was to allow this added layer of safety to spot instances, it would completely up-end the way spot instances are used and how they are positioned.
The trade-off for using spot instances has always been, "you can pay a fraction of the cost, but you risk being terminated at any instant without warning"...if they added an up to 60 minute 'warning' to spot instances, while it would be fantastic from the end-users point of view, I think it would severely eat into AWS's on-demand and reserved instance pricing model and thus they probably won't support this anytime soon (unless forced to by competitive pressure).
EDIT 1/6/2015: now, almost a year later, AWS has indeed added a 'two minute warning' for EC2 spot instance termination. https://aws.amazon.com/blogs/aws/new-ec2-spot-instance-termination-notices/