AWS batch to always launch new ec2 instance for each job - amazon-web-services

I have setup a batch environment with
Managed Compute environment
Job Queue
Job Definitions
The actual job(docker container) does a lot of video encoding and hence uses up most of the CPU. The process itself takes a few minutes (close to 5 minutes to get all the encoders initialized). Ideally I would want one job per instance so that the encoders are not CPU starved.
My issue is when I launch multiple jobs at the same time or close enough, AWS batch decides launch both of them in the same instance as the first container is still initializing and has not started using CPUs yet.
It seems like a race condition to me where both jobs see the instance created as available.
Is there a way I can launch one instance for each job without looking for instances that are already running? Or any other solution to lock an instance once it is designated for a particular job?
Thanks a lot for your help.

You shouldn't have to worry about separating the jobs onto different instances because the containers the jobs run in are limited in how many vCPUs they can use. For example, if you launch two jobs that each require 4 vCPUs, Batch might spin up an instance that has 8 vCPUs and run both jobs on the same instance. Each job will have access to only 4 of the vCPUs, so performance should be identical to a job running on its own with no other jobs on the instance.
However, if you still want to separate the jobs onto separate instances, you can do so by matching the vCPUs of the job with the instance type in the compute environment. For example, if you have a job that requires 4 vCPUs, you can configure your compute environment to only allow c5.xlarge instances, so each instance can run only one job. However, if you want to run other jobs with higher vCPU requirements, you would have to run them in a different compute environment.

Related

Increase and decrease AWS instances CPUs automatically

is there's a way in AWS to increase and decrease instances CPUs depending on pressure. I have been paying a lot of money for AWS statically increasing and decreasing instance cores when no clients are using it.
to be more specific, clients can upload an excel file and the software will do some calculations that will take time depending on the AWS instance cores. Having 2 cores will take 30 minutes to completion and having 96 cores will take only a couple of minutes.
Is there's a way to automatically increase the cores to 96 when the clients are using and uploading files to the website and automatically decrease the cores to 2 when no action is happening and clients are either not using the website or just using the website with current data and aren't taking a new action.
If not then can I possibly add a schedule in AWS to change the instance type. As an example run the instance on a 2 core type (ex: t2.large) and then change the instance type only from 1pm-6pm to 96 cores (ex: c5a.24xlarge) after that get it back to 2 cores?
I'm very new to AWS and devops in general, and I have been reading about AWS Autoscaling groups, but I'm not sure if this is the answer for my problem.
No, it is not possible to "scale CPU cores". (Commonly known as Vertical scaling.)
Instead, the recommended method is to add/remove parallel capacity based upon demand.
If you are using Amazon EC2, then you can launch more instances or terminate existing instances. This can be automated through Amazon EC2 Auto Scaling, which can monitor metrics (eg CPU Utilization) and then launch/terminate instances automatically. You would typically put a Load Balancer in front of these instances if they are web servers, or the instances might be 'worker nodes' that pull work from a queue.
If you are using containers (Docker, Kubernetes) then Amazon ECS/Amazon EKS can automatically add/remove tasks to meet demand for your application.
If you are using AWS Lambda functions, then they 'scale' by allowing multiple functions to run in parallel. Lambda functions typically exit when they have finished processing, so there is not charge when there is nothing to process.
These are all examples of Horizontal scaling, where capacity is added/removed in parallel.

AWS Batch permits approx 25 concurrent jobs in array configuration while compute environment allows using 256 CPU

I am running Job Array on the AWS Batch using Fargate Spot environment.
The main goal is to do some work as quickly as possible. So, when I run 100 jobs I expect that all of these jobs will be run simultaneously.
But only approx 25 of them start immediately, the rest of jobs are waiting with RUNNABLE status.
The jobs run on compute environment with max. 256 CPU. Each job uses 1 CPU or even less.
I haven't found any limits or quotas that can influence the process of running jobs.
What could be the cause?
I've talked with AWS Support and they advised me not to use Fargate when I need to process a lot of jobs as quick as possible.
For large-scale job processing, On-Demand solution is recommended.
So, after changed Provisioning model to On-Demand number of concurrent jobs grown up to CPU limits determined in settings, this was what I needed.

AWS Batch EC2 Provision Time

I'm relatively new to using AWS Batch, and have been noticing it takes a LONG time to spin up EC2 instances in a managed compute environment.
My jobs will go from Submitted > Pending > Runnable within 1 minute.
But sometimes they will sit in Runnable anywhere from 15 minutes to 1 hour before an EC2 instance finally gets around to spinning up.
Any tips and tricks on getting AWS Batch to spin up instances more quickly?
Ideally I'd like an instance the moment somethings in the Runnable state.
For some more context, I am using AWS Batch essentially like Lambda but choose your own instance and hard drive. I can't use lambda because the jobs need a lot more resources (GPUs) and time to process.
It would appear the scheduler takes its time based on non-transparent load at the data center.
Would love if creating a Batch Job returned estimated TTL.
But anyways, sometimes I get machines instantly, sometimes it takes up to 15 minutes, and sometimes it will take an hour or more for newer GPU instance types, because there are not any available.
There doesn't appear to be anyway to control the schedule. Oh well.
Note: Below setting might help reduce provision time, but will incur additional costs.
Compute environments -> Compute resources -> Minimum vCPUs
Making this = 1 (or more) will allow single instance to run all the time.
Compute environments -> Compute resources -> Allocation strategy
Changing this from "BEST_FIT" to "Best_Fit_Progressive" will also help.

Does every aws batch job spin up a new docker container

Every time I submit a batch job, does a new Docker container get created or the old container will be reused.
If a new Docker container is created every time, what happens to the container when the job is done.
In AWS ECS, ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION variable sets the time duration to wait from when a task is stopped until the Docker container is removed(by default 3 hours)
If all these containers only get cleanup after three hours, wouldn't the ECS container instance get filled up quick easily if I submit a lot of jobs?
Getting this error CannotCreateContainerError: API error (500): devmapper when running a batch job. Does it help if I clean up the docker container files at the end of the job?
Every time I submit a batch job, does a new Docker container get created or the old container will be reused.
Yes. Each job run on Batch will be run as a new ECS Task, meaning a new container for each job.
If all these containers only get cleanup after three hours, wouldn't the ECS container instance get filled up quick easily if I submit a lot of jobs?
This all depends on your job workloads, lengths, of jobs, disk usage, etc. With large quantities of short jobs that consume disk, this is entirely possible.
CannotCreateContainerError: API error (500): devmapper
Documentation for this error indicates a few possible solutions, however the first, which you've already called out may not help in this case.
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION which defaults to 3h on ECS, seems to be set to 2m by default on Batch Clusters - you can inspect the EC2 User Data on one of your batch instances to validate that it is set this way on your clusters. Depending on the age of the cluster, these settings may change. Batch does not automatically update to the latest ECS Optimized AMI without creation of a whole new cluster, so I would not be surprised if it does not change settings either.
If your cleanup duration setting is currently set low, you might try creating a custom AMI which provisions a larger than normal docker volume. By default, the ECS optimized AMIs ship with an 8GB root drive, and 22GB volume for docker.

AWS ECS running a task that requires many cores

I am conceptually trying to understand how to use AWS ECS to run my "cluster" jobs.
I have some scientific software inside a Docker container, that natively takes advantage of as many cores as the underlying instance has to offer.
My question in this case is, can I use AWS ECS to "increase" the number of "visible" cores to the task running inside my Docker container. For instance, is my "cluster" limited to only a single instance? Or is a "cluster" expandable to multiple instances?
I haven't been able to find any answers my looking through he AWS docs.
Cluster is just some EC2 instances that are ECS-enabled (are running special agent software) and grouped together. Tasks that you run on this cluster are spread across these instances. Each task can involve multiple containers. However, each container stays within its instance ‘boundaries’, hardware-wise. It is allocated a number of “CPU units” and shares them with other containers running on the same instance.
From my understanding, running a process spanning multiple cores in a container is not quite fitting ECS architecture idea—it seems like trying to do part of ECS’s scheduler job.
I found these resources useful when I was reading about it:
My notes on Amazon's ECS post by Jérôme Petazzoni
Application Architecture in ECS docs
Task Definition Parameters in ECS docs
I had a similar situation moving a Python app that used a script to spawn copies of itself based on the number of cores. The answer to this isn't so much an ECS problem as it is a Docker best practice... you should strive to use 1 process per container. (see https://docs.docker.com/engine/userguide/eng-image/dockerfile_best-practices/)
How I ended up implementing this was using a Dockerfile to run each process and then used essential ECS tasks so it will reload itself if the task died.
Your cluster is a collection of EC2 instances with the ECS service running. Each instance has a certain number of CPU 'units' (typically 1024 units === 1 core) and RAM. I profiled my app at peak load and tweaked the mix until I got it where I liked it. If your app can use more CPU than that, try giving it 2048 CPU or some other amount and see how it performs. I used Meros (https://meros.io/) to profile my app.
Hope this helps!
"increase" the number of "visible" cores to the task running inside my Docker container
Container and cluster is different things, you may run lot of containers on one instance, but you can't run one container on multiply instances.
Cluster - it is set of docker containers.
is my "cluster" limited to only a single instance?
no, you may choose number of instances in cluster