I am new with AWS EMR where I need to scale up/down my task nodes automatically based on the usage. What I am thinking to add SNS event on Cloudwatch alarm for AppPending (scale up) and isIdle ( scale down).
Am I thinking correctly?
Is there any good documentation on this?
Please advice.

There is no in-built capability within Amazon EMR to automatically scale the cluster size based upon some metric.
One method is to add/remove Task Nodes as a Job Step. This does not automatically scale based upon demand, but can scale when you know that a large job step is required.
For example, if the cluster is performing a batch of several job steps and one of the steps requires more servers:
Create a job step that adds Task Nodes
Create a job step to perform the work
Create a job step to remove excess Task Nodes
To be truly automatic, you would need to monitor some combination of metrics that would indicate heavy load, and then add/remove nodes accordingly. The choice of metrics, however, would depend upon your particular workloads.
Another option is to fire-up a cluster for specific jobs, then terminate the cluster when the job is finished.

You could take a look at Themis, an EMR autoscaling framework developed at Atlassian.
Current features include reactive autoscaling (based on current usage) as well as proactive autoscaling (based on predefined schedules).
The tool also comes with a simple Web UI and is very easy to configure.


AWS ECS app fast auto scaling for video encoding. What is the best way?

I am currently running a video encoding application on ECS but auto scaling is my biggest problem.
Users start live video encoding jobs from a front end. Once a job is placed, this is added as a redis queue (rq) job that runs on an ECS task placed on a c5d.large instance using ffmpeg.
Autoscaling is currently based on alarms. If cpu is > than a set percentage, a new instance and task is spawned. If cpu is low, instances are checked and if no jobs are running they are destroyed.
This is not a bad solution but it feels clunky and slow. If a user wants to start two jobs one right after the other, it takes a couple of minutes for the instance to spawn + task to be placed (even using warm groups).
Plus cloudwatch alarms take a while to refresh and are not a super reliable way of defining work that is being done (a video encoding at 720p will use less cpu than one at 1080p and thus mess all my alarm settings).
Is there a better solution that someone can guide me to that allows for fast and precise autoscaling other than relying on cloudwatch alarms? I am tempted to try to create my own autoscaling system based on current executing jobs / workers and spawn/destroy instances directly calling the API from my code, but I'm hoping to find a better solution directly from within AWS.
I too have this exact problem, AWS already has mediaconvert/elastictranscoder but it's just too expensive & I decided to create my own firstly on lambda with (serverless) where all jobs are a single function invocation but I had issues with 15mins function timeout mostly because I'm not copying codecs.
scaling at this point I would think is Kubernetes. This is the sort of problem that Kubernetes is intended to handle (dynamic resource scaling on demand). Kubernetes is rather non-trivial. K8s is what the industry has settled on for the most part, so there are probably a lot of reasons to just go that route. You could start with K3S (psst! i just knew that today) and move up to K8s when you are ready.
Since you're trying to find a solution directly from within AWS, you can try EKS but I'm not completely sure what the best would be.

How to spin up all nodes in my EMR cluster before running my spark job

I have an EMR cluster that can scale up to a maximum of 10 SPOT nodes. When not being used it defaults to 1 CORE node (and 1 MASTER) to save costs obviously. So in total it can scale up to a maximum of 11 nodes 1 CORE + 10 SPOT.
When I run my spark job it takes a while to spin up the 10 SPOT nodes and my job ends up taking about 4hrs to complete.
I tried waiting until all the nodes were spun up, then canceled my job and immediately restarted it so that it can start using the max resources immediately, and my job took only around 3hrs to complete.
I have 2 questions:
1. Is there a way to make YARN spin up all the necessary resources before starting my job? I already specify the spark-submit parameters such as num-executors, executor-memory, executor-cores etc. during job submit.
2. I havent done the cost analysis yet, but is it even worthwhile to do number 1 mentioned above? Does AWS charge for spin up time, even when a job is not being run?
Would love to know your insights and suggestions.
Thank You
I am assuming you are using AWS managed scaling for this. If you can switch to custom scaling you can set more aggressive scaling rules, you can also set the numbers of nodes to scale up by on each upscale and downscale, this will help you converge faster to the required number of nodes.
The only downside to custom scaling is that it will take 5 minutes to trigger.
Is there a way to make YARN spin up all the necessary resources before
starting my job?
I do not know how to achieve this. But, In my opinion, this is not worth doing it. Spark is intelligent enough to do this for us.
It knows how to distribute the task when more instances come up or go away in the cluster. There is a certain spark configuration which you should be aware of to achieve this.
You should set this to true spark.dynamicAllocation.enabled. There are some other relevant configurations that you can change or leave it as it is.
For more detail refer to this documentation spark.dynamicAllocation.enabled
Please see the documentation as per your spark version. This link is for the spark version 2.4.0
Does AWS charge for spin up time, even when a job is not being run?
You get charged for every second of the instance that you use, with a one-minute minimum. It is not important whether your job is being run or not. Even If they are idle in the cluster, you will have to pay for it.
Refer to these link for more detail:
Hope this gives you some idea about the EMR pricing and certain spark configuration related to the dynamic allocation.

Is there anyway I can use preemptible instance for dataflow jobs?

It's evident that preemptible instance are cheaper than non-preemptible instance. On daily basis 400-500 dataflow jobs are running in my organisational project. Out of which some jobs are time-sensitive and others are not. So is there any way I could use preemptible instance for non-time-constraint job, which will cost me less for overall pipeline execution. Currently I'm running dataflow jobs with below specified configuration.
I'm not able to find out any pipeline options with preemptible instance setting.
Yes, it is possible to do so with Flexible Resource Scheduling in Cloud Dataflow (docs). Note that there are some things to consider:
Delayed execution: jobs are scheduled and not executed right away (you can see a new QUEUED status for your Dataflow jobs). They are run opportunistically when resources are available within a six-hour window. This makes FlexRS suitable to reduce cost for non-time-critical workloads. Also, be sure to validate your code before sending the job.
Batch jobs: as of now it only accepts batch jobs and requires to enable autoscaling:
You cannot set autoscalingAlgorithm=NONE
Dataflow Shuffle: it needs to be enabled. When so, no data is stored on persistent disks attached to the VMs. This way, when a preemption happens and resources are claimed back there is no need to redistribute the data.
Regions: according to the previous item, only regions where Dataflow Shuffle is supported can be selected. List here, turn-up for new regions will be announced in the release notes. As of now, zone is automatically chosen within the region.
Machine types: FlexRS currently supports n1-standard-2 (default) and n1-highmem-16.
SDK: requires 2.12.0 or newer for Java or Python.
Quota: quota is reserved upfront (i.e. queued jobs also consume quota).
In order to run it, use --flexRSGoal=COST_OPTIMIZED and make sure to take into account that the rest of parameters conform to the FlexRS needs.
A uniform discount rate is applied to FlexRS jobs, you can compare pricing details in the following link.
Note that you might see a Beta disclaimer in the non-English documentation but, as clarified in the release notes, it's Generally Available.

It is possible use AutoScaling with Elastic Mapreduce?

I would like to know if I can use AutoScaling to automatically scaling up or down Amazon Ec2 capacity according to cpu utilization with elastic map reduce.
For example, I start a mapreduce job with only 1 instance, but if this instance arrive to 50% utilization for example I want to use the created AutoScaling group to start a new instance. This is possible?
Do you know if it is possible? Or elastic mapreduce because is "elastic", if it needs starts automatically more instances without any configuration?
You need Qubole:
We have never seen any of our users/customers use vanilla auto-scaling successfully with Hadoop. Hadoop is stateful. Nodes hold HDFS data and intermediate outputs. Deleting nodes based on cpu/memory just doesn't work. Adding nodes needs sophistication - this isn't a web site. One needs to look at the sizes of jobs submitted and the speed at which they are completing.
We run the largest Hadoop clusters, easily, on AWS (for our customers). And they auto-scale all the time. And they use spot instances. And it costs the same as EMR.
No, Auto Scaling cannot be used with Amazon Elastic MapReduce (EMR).
It is possible to scale EMR via API or Command-Line calls, adding and removing Task Nodes (which do not host HDFS storage). Note that it is not possible to remove Core Nodes (because they host HDFS storage, and removing nodes could lead to lost data). In fact, this is the only difference between Core and Task nodes.
It is also possible to change the number of nodes from within an EMR "Step". Steps are executed sequentially, so the cluster could be made larger prior to a step requiring heavy processing, and could be reduced in size in a subsequent step.
From the EMR Developer Guide:
You can have a different number of slave nodes for each cluster step. You can also add a step to a running cluster to modify the number of slave nodes. Because all steps are guaranteed to run sequentially by default, you can specify the number of running slave nodes for any step.
CPU would not be a good metric on which to base scaling of an EMR cluster, since Hadoop will keep all nodes as busy as possible when a job is running. A better metric would be the number of jobs waiting, so that they can finish quicker.
See also:
Stackoverflow: Can we add more Amazon Elastic Mapreduce instances into an existing Amazon Elastic Mapreduce instances?
Stackoverflow: Can Amazon Auto Scaling Service work with Elastic Map Reduce Service?

High availability periodic task (cron) on AWS

What is the recommend way/pattern for setting up a High Availability (multiple Availability Zones) periodic task (probably triggered using Cron) on AWS?
I would like to have the software installed on multiple EC2 instances in multiple Availability Zones but only have the task run on a single instance at a time. It doesn't matter which instance.
Before moving to AWS, we used to use database locking in a MySQL instance - only the instance that successfully creates a lock would run.
But there must be a better way on AWS? Particularly if there is no requirement for a database.
Since I first asked this question, it is now possible to use CloudWatch Events to schedule periodic events. Events can be:
A fixed rate of N minutes/hours/days
Cron expression
The targets can be:
An AWS Lambda function
Post to an SNS topic
Write to an SQS queue
Write to a Kinesis stream
Perform an action on EC2/EBS instance
SQS could then be used to notify a single instance in a cluster of machines in multiple availability zones to perform an action.
There is more information here:
Although it does not include a resilience/availability statement of what happens if an Availability Zone goes down.
One suggested solution that has been suggested to me is to use an Auto Scaling Group, with the maximum and minimum number of instances set to 1. This means that if an availability zone goes offline, the ASG will cause a new instance in another zone to be launched.
This technique was briefly covered on the Architecting on AWS training course, but I don't know how reliable it is.
Amazon just released the solution to your problem: Beanstalk worker tier periodic tasks:
It basically relies on a yaml file listing cron schedules calling the API you want:
To invoke periodic tasks, your application source bundle must include a cron.yaml file at the root level. The file must contain information about the periodic tasks you want to schedule. Specify this information using standard crontab syntax.
It really depends on your requirements.
You could post your tasks to a SQS queue and have your instances (possibly an autoscaling group spread across different zones) poll that queue. SQS' at least (and generally only) once semantics could be a problem here if it is critical for you that tasks get executed only once. If that's the case, you could easily use a DynamoDB table and conditional writes. Or, if you're more after a full-fledged fault-tolerant solution, you might give airbnb chronos a try.