why wont my aws batch job all start and run in paralllel? - amazon-web-services

hope someone can help me stop tearing my hair out!
I have a a job with array of ~700 indexes
When i submit the job , I get no more than 20-30 running simultaneously
They all run eventually which leads me to assume its a constraint else where and as all jobs the same, its not permissions/roles/connectivity.
They are array / index jobs, and one job in the queue I can't find any limits on these types of jobs running?
note i'm using ec2 unmanaged as the job was too big for fargate
i've tried
double checked they are parallel not sequential
dropped individual cpu / membory for each job to 0.25vcpu and 1gb memmory
created 'huge' compute environments of max 4096 vpu - no desired or min
added upto 3 compute env to a queue (as per limit)
what am i missing? hope someone can point me in a different direction
thanks
Ben

Based on the comments.
The issue was caused by EC2 service limits. AWS Bash will use EC2 to run the jobs, and it will not launch more resources then those specified by the EC2 limits. You can request increase the service quota of my Amazon EC2 resources to overcome the issue.

Related

AWS Batch permits only 32 concurrent jobs in array configuration

I'm running some AI experiments that requires multiple parallel runs in order to speed up the process.
I've built and pushed a container to ECR and I'm trying to run it with AWS Batch with an Array size of 35. But only 32 starts immediately while the last three jobs remains in the RUNNABLE state and don't start until one job has finished.
I'm running Fargate Spot for cost-saving reasons with 1 vcpu and 8GB RAM.
I looked at the documentation but there are no Service Quota Limits to increase regarding the size (the max seems to be 10k) neither in Fargate, ECS and AWS Batch.
What could be the cause ?
My bad. The limit is actually imposed in the Compute Environment associated with the jobs.
I answered myself hoping to help somebody in the future.

Limit concurrency of AWS Ecs tasks

I have deployed a selenium script on ECS Fargate which communicates with my server through API. Normally almost 300 scripts run at parallel and bombard my server with api requests. I am facing Net::Read::Timeout error because server is unable to respond in a given time frame. How can I limit ecs tasks running at parallel.
For example if I have ran 300 scripts, 50 scripts should run at parallel and remaining 250 scripts should be in pending state.
I think for your use case, you should have a look at AWS Batch, which supports Docker jobs, and job queues.
This question was about limiting concurrency on AWS Batch: AWS batch - how to limit number of concurrent jobs
Edit: btw, the same strategy could maybe be applied to ECS, as in assigning your scripts to only a few instances, so that more can't be provisioned until the previous ones have finished.
I am unclear how your script works and there may be many ways to peal this onion but one way that would be easier to implement assuming your tasks/scripts are long running is to create an ECS service and modify the number of tasks in it. You can start with a service that has 50 tasks and then update the service to 20 or 300 or any number you want. The service will deploy/remove tasks depending on the task count parameter you configured.
This of course assumes the tasks (and the script) run infinitely. If your script is such that it starts and it ends at some point (in a batch sort of way) then probably launching them with either AWS Batch or Step Functions would be a better approach.

How to spin up all nodes in my EMR cluster before running my spark job

I have an EMR cluster that can scale up to a maximum of 10 SPOT nodes. When not being used it defaults to 1 CORE node (and 1 MASTER) to save costs obviously. So in total it can scale up to a maximum of 11 nodes 1 CORE + 10 SPOT.
When I run my spark job it takes a while to spin up the 10 SPOT nodes and my job ends up taking about 4hrs to complete.
I tried waiting until all the nodes were spun up, then canceled my job and immediately restarted it so that it can start using the max resources immediately, and my job took only around 3hrs to complete.
I have 2 questions:
1. Is there a way to make YARN spin up all the necessary resources before starting my job? I already specify the spark-submit parameters such as num-executors, executor-memory, executor-cores etc. during job submit.
2. I havent done the cost analysis yet, but is it even worthwhile to do number 1 mentioned above? Does AWS charge for spin up time, even when a job is not being run?
Would love to know your insights and suggestions.
Thank You
I am assuming you are using AWS managed scaling for this. If you can switch to custom scaling you can set more aggressive scaling rules, you can also set the numbers of nodes to scale up by on each upscale and downscale, this will help you converge faster to the required number of nodes.
The only downside to custom scaling is that it will take 5 minutes to trigger.
Is there a way to make YARN spin up all the necessary resources before
starting my job?
I do not know how to achieve this. But, In my opinion, this is not worth doing it. Spark is intelligent enough to do this for us.
It knows how to distribute the task when more instances come up or go away in the cluster. There is a certain spark configuration which you should be aware of to achieve this.
You should set this to true spark.dynamicAllocation.enabled. There are some other relevant configurations that you can change or leave it as it is.
For more detail refer to this documentation spark.dynamicAllocation.enabled
Please see the documentation as per your spark version. This link is for the spark version 2.4.0
Does AWS charge for spin up time, even when a job is not being run?
You get charged for every second of the instance that you use, with a one-minute minimum. It is not important whether your job is being run or not. Even If they are idle in the cluster, you will have to pay for it.
Refer to these link for more detail:
EMR FAQ
EMR PRICING
Hope this gives you some idea about the EMR pricing and certain spark configuration related to the dynamic allocation.

AWS Batch job start too long with Min vCPUs=0 in Compute environment

I`m using AWS Batch. After job submitting I wait 10-15 minutes until my job gets RUNNING status. My Compute environment configuration is next:
Provisioning model: EC2
Instance types: m4.xlarge
Min vCPUs: 0
Desired vCPUs: 0
Max vCPUs: 4
ECR image size ~130 MB.
I understand that problem in Min vCPUs = 0. It takes some time to start ECS instance. But why so long??
To speed up my jobs running I run dummy job which works long time for maintaining ECS instance in running state. After that, my jobs run quickly.
I think this is a little late but the following threads might help other people:
Here you can find an explanation regarding the time to begin execution of jobs and why batch doesn't schedule jobs as quickly as expected:
https://forums.aws.amazon.com/thread.jspa?messageID=897734
Here you can find a thread with the issues encountered by a user and some solutions that were proposed:
https://www.reddit.com/r/aws/comments/amg7yk/is_there_an_opensource_alternative_to_aws_batch/
Here you can find a proposed way to configure the vCPU values:
https://forums.aws.amazon.com/thread.jspa?threadID=265573

AWS Batch EC2 Provision Time

I'm relatively new to using AWS Batch, and have been noticing it takes a LONG time to spin up EC2 instances in a managed compute environment.
My jobs will go from Submitted > Pending > Runnable within 1 minute.
But sometimes they will sit in Runnable anywhere from 15 minutes to 1 hour before an EC2 instance finally gets around to spinning up.
Any tips and tricks on getting AWS Batch to spin up instances more quickly?
Ideally I'd like an instance the moment somethings in the Runnable state.
For some more context, I am using AWS Batch essentially like Lambda but choose your own instance and hard drive. I can't use lambda because the jobs need a lot more resources (GPUs) and time to process.
It would appear the scheduler takes its time based on non-transparent load at the data center.
Would love if creating a Batch Job returned estimated TTL.
But anyways, sometimes I get machines instantly, sometimes it takes up to 15 minutes, and sometimes it will take an hour or more for newer GPU instance types, because there are not any available.
There doesn't appear to be anyway to control the schedule. Oh well.
Note: Below setting might help reduce provision time, but will incur additional costs.
Compute environments -> Compute resources -> Minimum vCPUs
Making this = 1 (or more) will allow single instance to run all the time.
Compute environments -> Compute resources -> Allocation strategy
Changing this from "BEST_FIT" to "Best_Fit_Progressive" will also help.