Running steps of EMR in parallel - web-services

I am running a spark-job on EMR cluster,The issue i am facing is all the
EMR jobs triggered are executing in steps (in queue)
Is there any way to make them run parallel
if not is there any alteration for that

Elastic MapReduce comes by default with a YARN setup very "step" oriented, with a single CapacityScheduler queue with the 100% of the cluster resources assigned. Because of this configuration, any time you submit a job to an EMR cluster, YARN maximizes the cluster usage for that single job, granting all available resources to it until it finishes.
Running multiple concurrent jobs in an EMR cluster (or any other YARN based Hadoop cluster, in fact) requires a proper YARN setup with multiple queues to properly grant resources to each job. YARN's documentation is quite good about all of the Capacity Scheduler features and it is simpler as it sounds.
YARN's FairScheduler is quite popular but it uses a different approach and may be a bit more difficult to configure depending on your needs. Given the simplest scenario where you have a single Fair queue, YARN will try to grant containers to waiting jobs as soon as they are freed by running jobs, ensuring that all the jobs submitted to a cluster get at least a fraction of compute resources as soon as they are available.

If you are concerned about YARN jobs running in a queue(submitted by spark)..
There are multiple solutions to run jobs in parallel ,
By default, EMR uses YARN CapacityScheduler with DefaultResourceCalculator and has one single DEFAULT queue where all YARN jobs are submitted. SInce there is only one queue, the number of yarn jobs that you can RUN(not submit) in parallel really depends on the parallel number of AM's , mapper and reducers that your EMR cluster supports.
For example : You have a cluster that can run atmost 10 mappers in parallel. (see AWS EMR Parallel Mappers?)
Suppose you submitted 2 map-only jobs each requiring 10 mappers one after another. The first job will take up all mapper container capacity and runs , while the second waits on the queue for the containers to free up. This behavior is similar for AM's and Reducers as well.
Now, to make them run in parallel inspire of having that limitation on number of containers that is supported by cluster ,
Keeping capacity scheduler , You can create multiple queues configuring %'s of capacity with Max capacity in each queue. So that job in first queue might not fully use up all containers even though it needs it. You can submit a seconds your job in second queue which will have pre-determined capacity.
You might need to use FAIR scheduler by configuring yarn-site.xml . The FAIR scheduler allows you share configure queues and share resources across those queues fairly. You might also use PREEMPTION option of fair scheduler.
Note that the choice of what option to go with - really depends on your use-case and business needs. It is important to learn about all options and possible impact.
https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781491901687/ch04.html

Amazon EMR now supports the ability to run multiple steps in parallel. The number of steps allowed to run at once is configurable and can be set when a cluster is launched and at any time after the cluster has started.
Please see this announcement for more details: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/.

Just adding updated information. EMR supports parallel steps:
https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/

Related

AWS Batch permits approx 25 concurrent jobs in array configuration while compute environment allows using 256 CPU

I am running Job Array on the AWS Batch using Fargate Spot environment.
The main goal is to do some work as quickly as possible. So, when I run 100 jobs I expect that all of these jobs will be run simultaneously.
But only approx 25 of them start immediately, the rest of jobs are waiting with RUNNABLE status.
The jobs run on compute environment with max. 256 CPU. Each job uses 1 CPU or even less.
I haven't found any limits or quotas that can influence the process of running jobs.
What could be the cause?
I've talked with AWS Support and they advised me not to use Fargate when I need to process a lot of jobs as quick as possible.
For large-scale job processing, On-Demand solution is recommended.
So, after changed Provisioning model to On-Demand number of concurrent jobs grown up to CPU limits determined in settings, this was what I needed.

Limit concurrency of AWS Ecs tasks

I have deployed a selenium script on ECS Fargate which communicates with my server through API. Normally almost 300 scripts run at parallel and bombard my server with api requests. I am facing Net::Read::Timeout error because server is unable to respond in a given time frame. How can I limit ecs tasks running at parallel.
For example if I have ran 300 scripts, 50 scripts should run at parallel and remaining 250 scripts should be in pending state.
I think for your use case, you should have a look at AWS Batch, which supports Docker jobs, and job queues.
This question was about limiting concurrency on AWS Batch: AWS batch - how to limit number of concurrent jobs
Edit: btw, the same strategy could maybe be applied to ECS, as in assigning your scripts to only a few instances, so that more can't be provisioned until the previous ones have finished.
I am unclear how your script works and there may be many ways to peal this onion but one way that would be easier to implement assuming your tasks/scripts are long running is to create an ECS service and modify the number of tasks in it. You can start with a service that has 50 tasks and then update the service to 20 or 300 or any number you want. The service will deploy/remove tasks depending on the task count parameter you configured.
This of course assumes the tasks (and the script) run infinitely. If your script is such that it starts and it ends at some point (in a batch sort of way) then probably launching them with either AWS Batch or Step Functions would be a better approach.

How to spin up all nodes in my EMR cluster before running my spark job

I have an EMR cluster that can scale up to a maximum of 10 SPOT nodes. When not being used it defaults to 1 CORE node (and 1 MASTER) to save costs obviously. So in total it can scale up to a maximum of 11 nodes 1 CORE + 10 SPOT.
When I run my spark job it takes a while to spin up the 10 SPOT nodes and my job ends up taking about 4hrs to complete.
I tried waiting until all the nodes were spun up, then canceled my job and immediately restarted it so that it can start using the max resources immediately, and my job took only around 3hrs to complete.
I have 2 questions:
1. Is there a way to make YARN spin up all the necessary resources before starting my job? I already specify the spark-submit parameters such as num-executors, executor-memory, executor-cores etc. during job submit.
2. I havent done the cost analysis yet, but is it even worthwhile to do number 1 mentioned above? Does AWS charge for spin up time, even when a job is not being run?
Would love to know your insights and suggestions.
Thank You
I am assuming you are using AWS managed scaling for this. If you can switch to custom scaling you can set more aggressive scaling rules, you can also set the numbers of nodes to scale up by on each upscale and downscale, this will help you converge faster to the required number of nodes.
The only downside to custom scaling is that it will take 5 minutes to trigger.
Is there a way to make YARN spin up all the necessary resources before
starting my job?
I do not know how to achieve this. But, In my opinion, this is not worth doing it. Spark is intelligent enough to do this for us.
It knows how to distribute the task when more instances come up or go away in the cluster. There is a certain spark configuration which you should be aware of to achieve this.
You should set this to true spark.dynamicAllocation.enabled. There are some other relevant configurations that you can change or leave it as it is.
For more detail refer to this documentation spark.dynamicAllocation.enabled
Please see the documentation as per your spark version. This link is for the spark version 2.4.0
Does AWS charge for spin up time, even when a job is not being run?
You get charged for every second of the instance that you use, with a one-minute minimum. It is not important whether your job is being run or not. Even If they are idle in the cluster, you will have to pay for it.
Refer to these link for more detail:
EMR FAQ
EMR PRICING
Hope this gives you some idea about the EMR pricing and certain spark configuration related to the dynamic allocation.

Is there anyway I can use preemptible instance for dataflow jobs?

It's evident that preemptible instance are cheaper than non-preemptible instance. On daily basis 400-500 dataflow jobs are running in my organisational project. Out of which some jobs are time-sensitive and others are not. So is there any way I could use preemptible instance for non-time-constraint job, which will cost me less for overall pipeline execution. Currently I'm running dataflow jobs with below specified configuration.
options.setTempLocation("gs://temp/");
options.setRunner(DataflowRunner.class);
options.setTemplateLocation("gs://temp-location/");
options.setWorkerMachineType("n1-standard-4");
options.setMaxNumWorkers(20);
options.setWorkerCacheMb(2000);
I'm not able to find out any pipeline options with preemptible instance setting.
Yes, it is possible to do so with Flexible Resource Scheduling in Cloud Dataflow (docs). Note that there are some things to consider:
Delayed execution: jobs are scheduled and not executed right away (you can see a new QUEUED status for your Dataflow jobs). They are run opportunistically when resources are available within a six-hour window. This makes FlexRS suitable to reduce cost for non-time-critical workloads. Also, be sure to validate your code before sending the job.
Batch jobs: as of now it only accepts batch jobs and requires to enable autoscaling:
You cannot set autoscalingAlgorithm=NONE
Dataflow Shuffle: it needs to be enabled. When so, no data is stored on persistent disks attached to the VMs. This way, when a preemption happens and resources are claimed back there is no need to redistribute the data.
Regions: according to the previous item, only regions where Dataflow Shuffle is supported can be selected. List here, turn-up for new regions will be announced in the release notes. As of now, zone is automatically chosen within the region.
Machine types: FlexRS currently supports n1-standard-2 (default) and n1-highmem-16.
SDK: requires 2.12.0 or newer for Java or Python.
Quota: quota is reserved upfront (i.e. queued jobs also consume quota).
In order to run it, use --flexRSGoal=COST_OPTIMIZED and make sure to take into account that the rest of parameters conform to the FlexRS needs.
A uniform discount rate is applied to FlexRS jobs, you can compare pricing details in the following link.
Note that you might see a Beta disclaimer in the non-English documentation but, as clarified in the release notes, it's Generally Available.

How to optimise aws cluster instance types in apache spark and drill aws cluster?

I am reading s3 buckets with drill and writing it back to s3 with parquet in order to read it with spark data frames for further analysis. I am required by AWS emr to have at least 2 core machines.
will using i mirco instance for master and cores affect performance?
I don't make a use of hdfs as such so I am thinking to make them mirco instances to save money.
All computation will be done in memory by R3.xlarge spot instances as task nodes anyway.
And finally does spark utilise multiple cores in each machine? or is it better to launch fleet of task nodes R3.xlarge with 4.1 version so they can be auto resized?
I don't know how familiar you are with Spark but there is a couple of things you need to know about core usage :
You can set the number of cores to use for the driver process, only in cluster mode. It's 1 by default.
You can also set the number of cores to use on each executor. For YARN and standalone mode only. It's 1 in YARN mode, and all the available cores on the worker in standalone mode. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.
Now to answer both of your questions :
will using i micro instance for master and cores affect performance?
Yes, the driver needs minimum resources to schedule job, collect data sometimes etc. Performance-wise you'll need to benchmark according to your use case on what suits your usage better which you can do using Ganglia per example on AWS.
does spark utilise multiple cores in each machine?
Yes Spark uses multiple cores on each machine.
You can also read this concerning Which instance type is preferred for AWS EMR cluster for Spark.
The support of Spark is nearly new on AWS, but it's usually close to all other Spark cluster setups.
I advice you to read the AWS EMR developer guide - Plan EMR Instances chapter along with the Spark official documentation guide.