We are running an EMR cluster with spot instances as task nodes. The EMR cluster is executing spark jobs which sometimes run for several hours. Interruptions of spot instances can cause the failure of the spark job which then requires us to restart the job entirely.
I can see that there is some basic information on the "Frequency of interruption" on AWS Spot Advisor - However, this data seems to be very generic, I can't see historic trends and I also miss the probability of interruption based on how long the spot instance is running (which should have a significant impact on the probability of interruption).
Is this data available somewhere? Or are there other data points that can be used as proxy?
I found this Github issue which provides a link to this JSON file in Spot Advisor S3 bucket that includes interruption rates.
https://spot-bid-advisor.s3.amazonaws.com/spot-advisor-data.json
Related
I want to use EMR spot instances to cut down my Redshift and aws glue costs, but after reading about them I want to know if I am running a 30 mins jobs how likely is it to get interrupted , How often these spot instances are taken away while running a Job and if they are taken away how can I manage my job to re-run again.
Mostly my focus is on spark job.
Opinion-based, but here goes.
Excellent read: https://aws.amazon.com/blogs/big-data/spark-enhancements-for-elasticity-and-resiliency-on-amazon-emr/
Basically AWS allow you to use spot instances and recover gracefully due to integration with YARN’s decommissioning mechanism. You need code nothing in your Spark App.
That said, if you are wanting to run using Spot Instances, you can wait for the output, but it may take a while.
AWS Glue is serverless and hence has nothing to do with EMR. Redshift is also costed differently.
I have an EMR cluster that can scale up to a maximum of 10 SPOT nodes. When not being used it defaults to 1 CORE node (and 1 MASTER) to save costs obviously. So in total it can scale up to a maximum of 11 nodes 1 CORE + 10 SPOT.
When I run my spark job it takes a while to spin up the 10 SPOT nodes and my job ends up taking about 4hrs to complete.
I tried waiting until all the nodes were spun up, then canceled my job and immediately restarted it so that it can start using the max resources immediately, and my job took only around 3hrs to complete.
I have 2 questions:
1. Is there a way to make YARN spin up all the necessary resources before starting my job? I already specify the spark-submit parameters such as num-executors, executor-memory, executor-cores etc. during job submit.
2. I havent done the cost analysis yet, but is it even worthwhile to do number 1 mentioned above? Does AWS charge for spin up time, even when a job is not being run?
Would love to know your insights and suggestions.
Thank You
I am assuming you are using AWS managed scaling for this. If you can switch to custom scaling you can set more aggressive scaling rules, you can also set the numbers of nodes to scale up by on each upscale and downscale, this will help you converge faster to the required number of nodes.
The only downside to custom scaling is that it will take 5 minutes to trigger.
Is there a way to make YARN spin up all the necessary resources before
starting my job?
I do not know how to achieve this. But, In my opinion, this is not worth doing it. Spark is intelligent enough to do this for us.
It knows how to distribute the task when more instances come up or go away in the cluster. There is a certain spark configuration which you should be aware of to achieve this.
You should set this to true spark.dynamicAllocation.enabled. There are some other relevant configurations that you can change or leave it as it is.
For more detail refer to this documentation spark.dynamicAllocation.enabled
Please see the documentation as per your spark version. This link is for the spark version 2.4.0
Does AWS charge for spin up time, even when a job is not being run?
You get charged for every second of the instance that you use, with a one-minute minimum. It is not important whether your job is being run or not. Even If they are idle in the cluster, you will have to pay for it.
Refer to these link for more detail:
EMR FAQ
EMR PRICING
Hope this gives you some idea about the EMR pricing and certain spark configuration related to the dynamic allocation.
I came across this question in my AWS practice and like to post it here for further discussion:
Your company is
planning on using the EMR service available in AWS for running their
big data framework and wants to minimize the cost for running the EMR
service. Which of the following could help achieve this?
Options:
A. Running the EMR cluster in a dedicated VPC
B. Choosing Spot Instances for the underlying nodes
C. Choosing On-Demand Instances for the underlying nodes
D. Disable automated backups
Correct Answer
B. Choosing Spot Instances for the underlying nodes
Question:
quoted from AWS document: When you use Spot Instances, you must be prepared for interruptions.
My understanding to the EMR service is it requires resources to complete the job (service), if say a mapreduce job is not given enough resource, the job will fail.
Spot instance, though the cost is low, it doesn't guarantee the availability, AWS states very clearly (quoted here from the same page):
If your maximum price exceeds the current Spot price for the specified
instance, and capacity is available, your request is fulfilled
immediately.
Note: "capacity is available", in another word, if capacity is NOT available, your request won't get fulfilled.
I think On-Demand instances is what should be chosen for the underlying nodes, get the job is more important than saving cost, it is meaningless if the job cannot be done.
AWS certification exam keeps throwing these kinds of googly.
Since it's not mentioned that the company doesn't want any interruptions, Spot instance is the right answer to minimize cost.
And from my experience Spot gives up to 80% discount compared to that of on-demand cost.
I have a AWS Redshift Cluster dc2.8xlarge and currently I am paying huge bill each month for running the cluster 24/7.
Is there a way I can automate the cluster uptime so that the cluster will be running in day time and I can stop the cluster at 8PM in evening and again start it in 8AM in morning.
Update: Stop/Start is now available. See: Amazon Redshift launches pause and resume
Amazon Redshift does not have a Start/Stop concept. However, there are a few options...
You could resize the cluster so that it is a lower-cost. A Redshift Cluster is sized for Compute and for Storage. You could reduce the number of nodes as long as you retain enough nodes for your Storage needs.
Also, Amazon Redshift has introduced RA3 nodes with managed storage enabling independent compute and storage scaling, which means you might be able scale-down to a single node. (This is a new node type, I'm not sure of how it works.)
Another option is to take a Snapshot and Shutdown the cluster. This will result in no costs for the cluster (but the Snapshot will be charged). Then, create a new cluster from the Snapshot when you want the cluster again.
Scheduling the above can be done in Amazon CloudWatch Events, which can trigger an AWS Lambda function. Within the function, you can make the necessary API calls to the Amazon Redshift service.
If you are concerned with the general cost of your cluster, you might want to downside from the dc2.8xlarge. You could either use multiple dc2.large nodes, or even consider a move to ds2.xlarge, which is a lower cost per TB of data stored.
good news :)
Now we can able to pause and resume the Redshift cluster (both Console and CLI)
check out the link:
https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/
Now we can pause and resume an AWS Redshift cluster.
We can also schedule the pause and the resume, which is a very important feature to check on the costs.
Link: https://aws.amazon.com/blogs/big-data/lower-your-costs-with-the-new-pause-and-resume-actions-on-amazon-redshift/
This will help you in automating the cluster uptime & downtime so that the cluster will be running in day time and is paused automatically at a specific time in the evening and again start in the morning automatically.
its pretty easy to use opensource https://cloudcustodian.io to automate nightime/weekend off hours on redshift and other aws resources.
I would like to know if I can use AutoScaling to automatically scaling up or down Amazon Ec2 capacity according to cpu utilization with elastic map reduce.
For example, I start a mapreduce job with only 1 instance, but if this instance arrive to 50% utilization for example I want to use the created AutoScaling group to start a new instance. This is possible?
Do you know if it is possible? Or elastic mapreduce because is "elastic", if it needs starts automatically more instances without any configuration?
You need Qubole: http://www.qubole.com/blog/product/industrys-first-auto-scaling-hadoop-clusters/
We have never seen any of our users/customers use vanilla auto-scaling successfully with Hadoop. Hadoop is stateful. Nodes hold HDFS data and intermediate outputs. Deleting nodes based on cpu/memory just doesn't work. Adding nodes needs sophistication - this isn't a web site. One needs to look at the sizes of jobs submitted and the speed at which they are completing.
We run the largest Hadoop clusters, easily, on AWS (for our customers). And they auto-scale all the time. And they use spot instances. And it costs the same as EMR.
No, Auto Scaling cannot be used with Amazon Elastic MapReduce (EMR).
It is possible to scale EMR via API or Command-Line calls, adding and removing Task Nodes (which do not host HDFS storage). Note that it is not possible to remove Core Nodes (because they host HDFS storage, and removing nodes could lead to lost data). In fact, this is the only difference between Core and Task nodes.
It is also possible to change the number of nodes from within an EMR "Step". Steps are executed sequentially, so the cluster could be made larger prior to a step requiring heavy processing, and could be reduced in size in a subsequent step.
From the EMR Developer Guide:
You can have a different number of slave nodes for each cluster step. You can also add a step to a running cluster to modify the number of slave nodes. Because all steps are guaranteed to run sequentially by default, you can specify the number of running slave nodes for any step.
CPU would not be a good metric on which to base scaling of an EMR cluster, since Hadoop will keep all nodes as busy as possible when a job is running. A better metric would be the number of jobs waiting, so that they can finish quicker.
See also:
Stackoverflow: Can we add more Amazon Elastic Mapreduce instances into an existing Amazon Elastic Mapreduce instances?
Stackoverflow: Can Amazon Auto Scaling Service work with Elastic Map Reduce Service?