How to autoscale EMR task instances - amazon-web-services

I am using EMR with task instance groups as spot instances. I want to maintain minimum number of task instances always.
Means, whenever EMR terminates task instances because of bid price goes higher than what we set, my application should launch another task instance with little higher bid price.
My research-
Use Cloudwatch to inform when it breaches threshhold, and auto-scale task instances. But as per study, there is no concept of auto-scaling in EMR.
Use Cloudwatch, and notify SQS when threshhold breahes, and there is one service who is always consuming and expand task instances.
Questions
Is there any auto-scaling present in EMR ? If that is available, then my efforts will reduce to just set threshhold, and corresponding expansion task instances action.
If you have any other approach to solve this problem, please suggest.

How Spot Prices Work
When an Amazon EC2 instance is launched with a spot price (including when launched from Amazon EMR), the instance will start if the current spot price is below the provided bid price. If the spot price rises above the bid price, the instance is terminated. Instances are only charged the current spot price.
Therefore, the logic of launching a new spot instance with a "little higher bid price" is not necessary. The instance will always be charged the current spot price, so simply bid as high as you are willing to pay for a spot instance. You will either pay less than the spot price (great!) or your instance will be terminated because the price has gone higher than you are willing to pay (in which case you don't want to pay a "little higher" for the instance).
If you wish to "maintain minimum number of task instances" at all times, then either pay the normal EMR charge (which means the instances won't be terminated) or bid a particularly large price for the spot instances, such as 2 x the normal price. Yes, you might occasionally pay more for instances, but on average your price will be quite low.
If you wish to be particularly sneaky, you could bid up to the normal price for the EC2 instances then, if instances are terminated, launch more task nodes without using spot pricing. That way, your instances won't be terminated and you won't pay more than the normal EC2 price. However, you would have to terminate and replace those instances when the spot price drops, otherwise you are paying too much. That's why it might be better just to provide a high bid price on your spot instances.
Bottom line: Use spot pricing, but bid a high price. You'll get a good price most of the time.

AWS EMR does not have a autoscaling option available. But you can use a work around and integrate Autoscaling using AWS SQS. This is a rough picture what you can integrate.
Launch you EMR cluster using spot instance.
Set up a SQS Queue and create 3 triggers one for CPU threshold , second for EC2 spot instance termination notice and third for changing the spot instance bid prices.
So if the CPU usage increases SQS will trigger an event to launch a new instance to cluster, if there is spot instance termination notice SQS will trigger to launch another instance to balance the load and send a event to change the bid price to launch another spot instance. (This is just rough sketch but I guess you will understand the logic.
This is guide to AWS SQS Autoscaling.
https://docs.aws.amazon.com/autoscaling/latest/userguide/as-using-sqs-queue.html

As has been correctly pointed, the EMR API provides all necessary ingredients to 1) collect monitoring data, and 2) programmatically scale the cluster up and down.
Basically, there are two main options to implement autoscaling for EMR clusters:
Autoscaling Loop: A process that is running on a server and continuously monitors the cluster for its current load. Performance metrics (memory, CPU, I/O, etc) can be collected in regular intervals and stored in a database. Autoscaling rules are evaluated against the performance metrics, and the cluster's task nodes are scaled up or down if required.
Event-Based Autoscaling: Using CloudWatch metrics (e.g., metrics for EMR or EC2), you can programmatically define triggers that are fired under certain conditions (for instance, add nodes if average CPUUtilization of all nodes exceeds 80%).
Both options have their pros and cons. The main advantage of option 2 is that it is a server-less approach (does not require to run your own server). Option 1, on the other hand, does require a server, but therefore comes with more control to customize the logic of your scaling rules. Also, it allows to keep searchable records of the history of the scaling decisions.
You could take a look at Themis, an EMR autoscaling framework developed at Atlassian. Themis implements the autoscaling loop as discussed in option 1 above. Current features include proactive as well as reactive autoscaling, support for spot/on-demand task nodes, it comes with a Web UI, and the tool is very easy to configure.

I have had a similar problem, and I wanted to share one possible alternative. I have written a Java tool to dynamically resize an EMR cluster during the processing. It might help you. Check it out at:
http://www.lopakalogic.com/articles/hadoop-articles/dynamically-resize-emr/
The source code is available on Github

Related

Cloudwatch Period time

CPU metrics cannot be selected below 1 minute in Cloudwatch service. For example, how can I lower this period time to trigger the Autoscale scale faster? I just need to trigger the AutoScale instances in short time. (By the way, datapoints value 1 to 1)
the minimum granularity for the metrics that EC2 provides is 1 minute.
Source: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html
Would also say that if you need to scale that quickly, wouldn't the startup time be an issue anyway?
You are correct -- basic monitoring of an Amazon EC2 instance provides metrics over 5-minute periods. If you activate EC2 Detailed Monitoring, metrics are provided over 1-minute periods. Extra charges apply for Detailed Monitoring.
When launching a new instance via Amazon EC2 Auto-Scaling, it can take a few minutes for the new instance to launch and for the User Data script (if any) to run. Linux instances are quite fast, but Windows instances take a while on their first boot due to sysprep operations.
You mention that you want to react to a metric in less than one minute. I would suggest that this would not be an ideal way to trigger Auto-scaling. Sometimes a computer can be busy for a while, then can drop down again. Reacting too quickly to a high CPU load would cause the Auto-Scaling group to flap between adding instances and terminating instances. It is better to provision enough capacity for a reasonable amount of extra load and then gradually add more capacity as it is required over time.
If you have a need to react so quickly, then perhaps you should investigate using AWS Lambda to perform small amounts of work in a highly-parallel fashion rather than relying on Amazon EC2 instances.

How to use existing On-Demand instance in spot fleet?

I'm trying to reduce my expenses and want to start using AWS's spot pricing service. I'm completely new to it, but as I understand I can have instances running for certain amounts of time based on the price that will eventually stop running based on certain conditions. That's fine, I'm also aware you can have spot fleets, and in these fleets you can have an On-Demand instance for when the spot instance is interrupted.
I currently have a an On-Demand instance that hosts an ElasticBeanStalk application (it's an API), is there a way to use this instance inside the spot fleet so that when there's an available spot-instance it's servicing my EBS application then when the spot-instance is interrupted it just goes back to using my current On-Demand instance until another spot-instance is available?
Sadly, spot fleets don't work like this. If your spot instance gets terminated, no on-demand replacement is going to be created for you automatically. If it worked like this, everyone would be using spot instances in my view.
The on-demand portion of your spot fleet is separate from spot portion. This way your application will always run at minimum capacity (without spot). When spot is available, your spot instances will run along side your on-demand. This way you will have more computational power for cheap, which is very beneficial for any heavy processing application (e.g. batch image processing).
Details of how spot fleet and spot instances work are in How Spot Fleet works and How Spot Instances work docs.
Nevertheless, if you would like to have such replacement provisioned you would have to develop a custom solution for that.
There's a third-party solution called Spot.io that not only replaces the spot instance for an on-demand instance in a scenario like the one you describe but it has an algorithm that anticipates the interruption event and stands up an On-demand instance and has it ready before the interruption occurs.

Why we can use Spot Instances for the EMR cluster in AWS?

I came across this question in my AWS practice and like to post it here for further discussion:
Your company is
planning on using the EMR service available in AWS for running their
big data framework and wants to minimize the cost for running the EMR
service. Which of the following could help achieve this?
Options:
A. Running the EMR cluster in a dedicated VPC
B. Choosing Spot Instances for the underlying nodes
C. Choosing On-Demand Instances for the underlying nodes
D. Disable automated backups
Correct Answer
B. Choosing Spot Instances for the underlying nodes
Question:
quoted from AWS document: When you use Spot Instances, you must be prepared for interruptions.
My understanding to the EMR service is it requires resources to complete the job (service), if say a mapreduce job is not given enough resource, the job will fail.
Spot instance, though the cost is low, it doesn't guarantee the availability, AWS states very clearly (quoted here from the same page):
If your maximum price exceeds the current Spot price for the specified
instance, and capacity is available, your request is fulfilled
immediately.
Note: "capacity is available", in another word, if capacity is NOT available, your request won't get fulfilled.
I think On-Demand instances is what should be chosen for the underlying nodes, get the job is more important than saving cost, it is meaningless if the job cannot be done.
AWS certification exam keeps throwing these kinds of googly.
Since it's not mentioned that the company doesn't want any interruptions, Spot instance is the right answer to minimize cost.
And from my experience Spot gives up to 80% discount compared to that of on-demand cost.

reduce price by on AWS (EC2 and spot instances)

I have a queue of jobs and running AWS EC2 instances which process the jobs. We have an AutoScaling groups for each c4.* instance type in spot and on-demand version.
Each instance has power which is a number equal to number of instances CPUs. (for example c4.large has power=2 since it has 2 CPUs).
The the exact power we need is simply calculated from the number of jobs in the queue.
I would like to implement an algorithm which would periodically check the number of jobs in the queue and change the desired value of the particular AutoScaling groups by AWS SDK to save as much money as possible and maintain the total power of instances to keep jobs processed.
Especially:
I prefer spot instances to on-demand since they are cheaper
EC2 instances are charged per hour, we would like to turn off the instance only at the very last minute of its 1hour uptime.
We would like to replace on-demand instance by spot instances when possible. So, at 55min increase spot-group, at 58 check the new spot instance is running and if yes, decrease on-demand-group.
We would like to replace spot instances by on-demand if the bid would be too high. Just turn off the on-demand one and turn on the spot one.
Seems the problem is really difficult to handle. Anybody have any experience or a similar solution implemented?
You could certainly write your own code to do this, effectively telling your Auto Scaling groups when to add/remove instances.
Also, please note that a good strategy for lowering costs with Spot Instances is to appreciate that the price for a spot instance varies by:
Region
Availability Zone
Instance Type
So, if the spot price for a c4.xlarge goes up in one AZ, it might still be the same cost in another AZ. Also, the price of a c4.2xlarge might then be lower than a c4.xlarge, with twice the power.
Therefore, you should aim to diversity your spot instances across multiple AZs and multiple instance types. This means that spot price changes will impact only a small portion of your fleet rather than all at once.
You could use Spot Fleet to assist with this, or even third-party products such as SpotInst.
It's also worth looking at AWS Batch (not currently available in every region), which is designed to intelligently provide capacity for batch jobs.
Autoscaling groups allow you to use alarms and metrics that are defined outside of the autoscaling group.
If you are using SNS, you should be able to set up an alarm on your SNS queue and use that to scale up and scale down your scaling group.
If you are using a custom queue system, you can push metrics to cloudwatch to create a similar alarm.
You can determine how often scaling actions do occur, but it may be difficult to get the run time to exactly one hour.

Ec2 spot pricing graph stopped working

A long time ago, there was the most useful spot price comparison graph that I have ever used, but it stopped working, as far as I know, because the creator ran out of time to maintain it. The website is still active ec2price.com and the code is on Github. Does anyone know if anyone has replicated this? or any way to do it myself? As I said it was really useful to decide which spot instance to choose.
You can see this information in the EC2 console by browsing to Spot Requests and selecting the Pricing History button.
If you want to select the cheapest ec2 instance type automatically you can create a spot fleet request; select all the instance types you might want to use and an allocation strategy of lowestPrice. Deploy this to a VPC with a subnet in all availability zones in your region to get the lowest price possible.
Besides codes, somebody must pay to maintain server that polling the information.
Check out How Spot Fleet Works. Spot fleet is way better than price monitoring. You can make request base on pricing for a fleet of instance type than limit yourself to specific instance type. You can kick start instances from a large fleet base on maximum instance price or vCPU price.
If you are using a SPOT ready batch application, after submit your bidding for fleet of different instance type and set the maximum per vCPU price, Spotfleet will automatically launch available instance with the best price. So you don't need to compete with limited popular instance(for example, c4.* SPOT instance is scarce for most region).
This is a win-win for both AWS and customer, as AWS able to spread the usage to underutilized instance type. IMHO, There is no point of keep raising the bid for particular type if those instance exhausted, while there is still many idle AWS instances in alternate zone that are not fully utilised for grab.