AWS ECS unable to run more than 10 number of tasks - amazon-web-services

I have an ECS Cluster with say 20 registered instances.
I have 3 task definitions to solve a big data problem.
Task 1: Split Task - This starts a docker container and the container definition has an entrypoint to run a script called HPC-Split. This script splits the big data into say 5 parts in a mounted EFS.
The number of tasks (count) for this task is 1.
Task 2: Run Task: This starts another docker container and this docker container has an entrypoint to run a script called HPC-script which processes each split part. The number of tasks selected for this is 5, so that this is processed in parallel.
Task 3: Merge Task: This starts a third docker container which has an entrypoint to run a script called HPC-Merge and this merges the different outputs from all the parts. Again, the number of tasks (count) that we need to run for this is 1.
Now AWS service limits say: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service_limits.html
The maximum tasks (count) we can run is 10. So we are at the moment able to run only 10 processes in parallel.
Meaning, Split the file (1 task runs on one instance), Run the process (task runs on 10 instances), Merge the file (task runs on 1 instance.)
The limit of 10 is limits the level at which we can parallelize our processing and I don't know how to get around. I am surprised about this limit because there is surely a need to run long running processes on more than 10 instances in the cluster.
Can you guys please give me some pointers on how to get around this limit or how to use ECS optimally to run say 20 number of tasks parallely.
The spread placement I use is 'One task per host' because the process uses all cores in one host.
How can I architect this better with ECS?

Number of tasks launched (count) per run-task
This is the maximum number of tasks that can be launched per invocation of the run-task API. To launch more tasks, call the run-task API again.

If your tasks that do the split work are architected to wait until such work is available somehow (with a queue system of some kind or whatever), I would launch them as a service and simply change the 'Desired Tasks' number from zero to 20 as needed.
When you need the workers, scale the service up to 20 Desired Tasks. Then launch your task to split the work and launch the task that waits for the work to be done. When the workers are all done, you can scale them back down to zero.
This also seems like work better suited for Fargate unless you have extreme memory or disk size needs. Otherwise you'll likely want to pair this with scaling up the EC2-based Cluster as needed and back down when not.

Related

AWS Batch permits only 32 concurrent jobs in array configuration

I'm running some AI experiments that requires multiple parallel runs in order to speed up the process.
I've built and pushed a container to ECR and I'm trying to run it with AWS Batch with an Array size of 35. But only 32 starts immediately while the last three jobs remains in the RUNNABLE state and don't start until one job has finished.
I'm running Fargate Spot for cost-saving reasons with 1 vcpu and 8GB RAM.
I looked at the documentation but there are no Service Quota Limits to increase regarding the size (the max seems to be 10k) neither in Fargate, ECS and AWS Batch.
What could be the cause ?
My bad. The limit is actually imposed in the Compute Environment associated with the jobs.
I answered myself hoping to help somebody in the future.

Limit concurrency of AWS Ecs tasks

I have deployed a selenium script on ECS Fargate which communicates with my server through API. Normally almost 300 scripts run at parallel and bombard my server with api requests. I am facing Net::Read::Timeout error because server is unable to respond in a given time frame. How can I limit ecs tasks running at parallel.
For example if I have ran 300 scripts, 50 scripts should run at parallel and remaining 250 scripts should be in pending state.
I think for your use case, you should have a look at AWS Batch, which supports Docker jobs, and job queues.
This question was about limiting concurrency on AWS Batch: AWS batch - how to limit number of concurrent jobs
Edit: btw, the same strategy could maybe be applied to ECS, as in assigning your scripts to only a few instances, so that more can't be provisioned until the previous ones have finished.
I am unclear how your script works and there may be many ways to peal this onion but one way that would be easier to implement assuming your tasks/scripts are long running is to create an ECS service and modify the number of tasks in it. You can start with a service that has 50 tasks and then update the service to 20 or 300 or any number you want. The service will deploy/remove tasks depending on the task count parameter you configured.
This of course assumes the tasks (and the script) run infinitely. If your script is such that it starts and it ends at some point (in a batch sort of way) then probably launching them with either AWS Batch or Step Functions would be a better approach.

AWS batch to always launch new ec2 instance for each job

I have setup a batch environment with
Managed Compute environment
Job Queue
Job Definitions
The actual job(docker container) does a lot of video encoding and hence uses up most of the CPU. The process itself takes a few minutes (close to 5 minutes to get all the encoders initialized). Ideally I would want one job per instance so that the encoders are not CPU starved.
My issue is when I launch multiple jobs at the same time or close enough, AWS batch decides launch both of them in the same instance as the first container is still initializing and has not started using CPUs yet.
It seems like a race condition to me where both jobs see the instance created as available.
Is there a way I can launch one instance for each job without looking for instances that are already running? Or any other solution to lock an instance once it is designated for a particular job?
Thanks a lot for your help.
You shouldn't have to worry about separating the jobs onto different instances because the containers the jobs run in are limited in how many vCPUs they can use. For example, if you launch two jobs that each require 4 vCPUs, Batch might spin up an instance that has 8 vCPUs and run both jobs on the same instance. Each job will have access to only 4 of the vCPUs, so performance should be identical to a job running on its own with no other jobs on the instance.
However, if you still want to separate the jobs onto separate instances, you can do so by matching the vCPUs of the job with the instance type in the compute environment. For example, if you have a job that requires 4 vCPUs, you can configure your compute environment to only allow c5.xlarge instances, so each instance can run only one job. However, if you want to run other jobs with higher vCPU requirements, you would have to run them in a different compute environment.

Updating an AWS ECS Service

I have a service running on AWS EC2 Container Service (ECS). My setup is a relatively simple one. It operates with a single task definition and the following details:
Desired capacity set at 2
Minimum healthy set at 50%
Maximum available set at 200%
Tasks run with 80% CPU and memory reservations
Initially, I am able to get the necessary EC2 instances registered to the cluster that holds the service without a problem. The associated task then starts running on the two instances. As expected – given the CPU and memory reservations – the tasks take up almost the entirety of the EC2 instances' resources.
Sometimes, I want the task to use a new version of the application it is running. In order to make this happen, I create a revision of the task, de-register the previous revision, and then update the service. Note that I have set the minimum healthy percentage to require 2 * 0.50 = 1 instance running at all times and the maximum healthy percentage to permit up to 2 * 2.00 = 4 instances running.
Accordingly, I expected 1 of the de-registered task instances to be drained and taken offline so that 1 instance of the new revision of the task could be brought online. Then the process would repeat itself, bringing the deployment to a successful state.
Unfortunately, the cluster does nothing. In the events log, it tells me that it cannot place the new tasks, even though the process I have described above would permit it to do so.
How can I get the cluster to perform the behavior that I am expecting? I have only been able to get it to do so when I manually register another EC2 instance to the cluster and then tear it down after the update is complete (which is not desirable).
I have faced the same issue where the tasks used to get stuck and had no space to place them. Below snippet from AWS doc on updating a service helped me to make the below decision.
If your service has a desired number of four tasks and a maximum
percent value of 200%, the scheduler may start four new tasks before
stopping the four older tasks (provided that the cluster resources
required to do this are available). The default value for maximum
percent is 200%.
We should have the cluster resources available / container instances available to have the new tasks get started so they can start and the older one can drain.
These are the things i do
Before doing a service update add like 20% capacity to your cluster. You can use the ASG (Autoscaling group) commandline and from the desired capacity add 20% to your cluster. This way you will have some additional instance during deployment.
Once you have the instance the new tasks will start spinning up quickly and the older one will start draining.
But does this mean i will have extra container instances ?
Yes, during the deployment you will add some instances but as the older tasks drain they will hang around. The way to remove them is
Create a MemoryReservationLow alarm (~70% threshold in your case) for like 25 mins (longer duration to be sure that we have over commissioned). As the reservation will go low once you have those extra server not being used they can be removed.
I have seen this before. If your port mapping is attempting to map a static host port to the container within the task, you need more cluster instances.
Also this could be because there is not enough available memory to meet the memory (soft or hard) limit requested by the container within the task.

Elastic beanstalk periodic tasks on autoscaled environment

On an autoscaled environment running a periodic task, if the environment is scaled up, do the periodic tasks get run on each instance? Or more specifically, does each instance then post to the queue leading to multiple "periodic tasks" running?
Yes. If there's some periodic task that should only be triggered once, you should have a separate auto scale environment of minimum 1 maximum one instance to either perform the task or trigger it on one of your servers (maybe make a request to your load balancer and one of your instances will perform the task)
Yes, behind the screen it's just a cron job on all your instances. The default scenario for using periodic tasks is to read the tasks from the SQS queue on the worker nodes.
So yes, if you doing some kind of posting what has to happen only once, then you either need to put some logic between or use a different solution.
(For example generating some kind of time based ID which identifies the cycle of the cron job. So messages from the same cycle are having the same id, easy to filter them/ ignore everything after the firs.