Limit concurrency of AWS Ecs tasks - amazon-web-services

I have deployed a selenium script on ECS Fargate which communicates with my server through API. Normally almost 300 scripts run at parallel and bombard my server with api requests. I am facing Net::Read::Timeout error because server is unable to respond in a given time frame. How can I limit ecs tasks running at parallel.
For example if I have ran 300 scripts, 50 scripts should run at parallel and remaining 250 scripts should be in pending state.

I think for your use case, you should have a look at AWS Batch, which supports Docker jobs, and job queues.
This question was about limiting concurrency on AWS Batch: AWS batch - how to limit number of concurrent jobs
Edit: btw, the same strategy could maybe be applied to ECS, as in assigning your scripts to only a few instances, so that more can't be provisioned until the previous ones have finished.

I am unclear how your script works and there may be many ways to peal this onion but one way that would be easier to implement assuming your tasks/scripts are long running is to create an ECS service and modify the number of tasks in it. You can start with a service that has 50 tasks and then update the service to 20 or 300 or any number you want. The service will deploy/remove tasks depending on the task count parameter you configured.
This of course assumes the tasks (and the script) run infinitely. If your script is such that it starts and it ends at some point (in a batch sort of way) then probably launching them with either AWS Batch or Step Functions would be a better approach.

Related

AWS Batch permits approx 25 concurrent jobs in array configuration while compute environment allows using 256 CPU

I am running Job Array on the AWS Batch using Fargate Spot environment.
The main goal is to do some work as quickly as possible. So, when I run 100 jobs I expect that all of these jobs will be run simultaneously.
But only approx 25 of them start immediately, the rest of jobs are waiting with RUNNABLE status.
The jobs run on compute environment with max. 256 CPU. Each job uses 1 CPU or even less.
I haven't found any limits or quotas that can influence the process of running jobs.
What could be the cause?
I've talked with AWS Support and they advised me not to use Fargate when I need to process a lot of jobs as quick as possible.
For large-scale job processing, On-Demand solution is recommended.
So, after changed Provisioning model to On-Demand number of concurrent jobs grown up to CPU limits determined in settings, this was what I needed.

Auto scaling service in AWS without duplicating cron jobs

I have a (golang web server) service running on AWS on a EC2 (no auto scaling). This service has a few cron jobs that runs throughout the day and these jobs starts when the service starts.
I would like to take advantage of auto scaling in some form on AWS. Been looking at ECS and Beanstalk.
When I add auto scaling I need the cron job to only execute on one of the scaled services due to rate limits on external APIs. Right now the cron job is tightly coupled within the service and I am looking for an option that does not require moving the cron job to its own service.
How can I achieve this in a good way using AWS?
You're going to get this problem as a general issue in any scalable application where crons cannot / should not run multiple times. It's not really AWS specific. I'm not sure to what extent you want to keep things coupled or how your crons are currently run but here are a few suggestions that might work for you:
Create a "cron runner" instance with a limit to run crons on
You could create a separate ECS service which has no autoscaling and a fixed value of 1 instance. This instance would run the same copy of your code as your "normal" instances and would run crons. You would turn crons off on your "normal" instances. You might find that this can be a very small instance since it doesn't handle any web traffic.
Create a "cron trigger" instance which fires off crons remotely
Here you create one "trigger" instance which sends a request to your normal instances through an ALB. Because your ALB will route the request to 1 of the servers behind it the cron only gets run once. One watch out with this is that if your cron is long running, you may need to consider your request timeouts. You'll also have to think about retries etc but I assume you already have a process that can be adapted for that.
The above solutions can be adapted with message queues etc but the basis of both is that there is another instance of some kind which starts the cron and is separate from your normal servers. Depending on when your cron runs, you may only need to run this cron instance for a few hours per day so it can be cost efficient to do things like this.
Personally I have used both methods in a multi-tenant application and I had to go with the option of running the cron like this due to the number of tenants and the time / resource it took to run the crons for all of them at once:
Cloudwatch schedule triggers a lambda which sends a message to SQS to queue a cron for each tenant individually.
Cron servers (totally separate from main web servers but running same / similar code) pull messages and run the cron for each tenant individually. Stores a key in redis for crons which are vital to only run once to stop issues with "at least once" delivery so crons don't run twice.
This can also help handle failures with retry policies and deadletter queues managed in SQS.
Ultimately you need to kick off these crons from one place. If possible, change up your crons so it doesn't matter if they run twice. It makes it easier to deal with retries and things like that.

Is anyone implemented django celery worker as docker container it only runs when task assigned

I am able to successfully deploy Django Celery worker as a docker container in AWS ECS service using FARGATE as computing.
But my concern is that celery container is running 24/7. If I can run container only when task is assigned, I can save lot of money as per AWS FARGATE billing methodology.
Celery isn't really the right thing to use because it's designed to persist, but the goal should be reasonably easy to achieve.
Architecturally, you probably want to run a script on a Fargate task. The script chews through the queue and then dies. You'd trigger that task somehow:
An API call from your data receiver (e.g. Django)
A lambda function (triggered by what?)
Still some open questions... do you limit yourself to one task at a time or do you need to manage concurrent requests to the queue? Do you retry? But a plausible place to start.
A not-recommended but perhaps easier way to do it would be to run a celery worker in your Django container (e.g. using supervisor) and use Fargate's autoscaling features. You'd always have the one Django container running to receive data. If the celery worker on that container used up all of the available resources, Fargate would scale the service by adding tasks. Once the jobs were done, it'd remove the excess containers. You'd be paying the "overhead" for Django in each container, but it could cost you less than an always-on celery container and would certainly be simpler -- leverage your celery experience and avoid the extra layer of event handling.
EDIT: Another disadvantage of this version is that you need to run Redis somewhere and I've found the minimum cost for this to be relatively high.
Based on my growing AWS experience, here's what you probably should do...
Use AWS API Gateway as an always-on receiver of events/requests. You only pay for requests, the free tier includes a million per month, and the next 300M are $1 (pricing) so this is likely to be free.
While you have many options for responding to the request, an AWS Lambda function (which can be written in python) should have the least overhead.
If your queue will run longer than a Lambda function allows (15 minutes), you'll need to have that Lambda function delegate the processing to e.g. a Fargate task.
(Optional) If you want to user a Dockerhub container for your Fargate task, we experienced a bunch of issues with Tasks and Services failing to start due to rate limits at Dockerhub. We ended up wrapping our Fargate task in a Step Function that checked for this error specifically and retried.
(Optional) If you need to limit concurrency, this SO answer suggests having your Lambda function check for an existing execution (of a Step Function or Fargate task). I was hoping there was something native on Fargate Tasks or Step Functions but I don't see anything.
I imagine this would represent a huge operating cost savings over the always-on Fargate task and Elasticache Redis queue, but the up-front cost/hassle could exceed the savings.
Have you thought of using AWS Lambda instead of the celery worker? You would then pay per task execution, where cost is driven by execution time and memory usage. If you have an application which is mostly idle then paying per request, skipping the idle cost, would make the most sense.

AWS ECS unable to run more than 10 number of tasks

I have an ECS Cluster with say 20 registered instances.
I have 3 task definitions to solve a big data problem.
Task 1: Split Task - This starts a docker container and the container definition has an entrypoint to run a script called HPC-Split. This script splits the big data into say 5 parts in a mounted EFS.
The number of tasks (count) for this task is 1.
Task 2: Run Task: This starts another docker container and this docker container has an entrypoint to run a script called HPC-script which processes each split part. The number of tasks selected for this is 5, so that this is processed in parallel.
Task 3: Merge Task: This starts a third docker container which has an entrypoint to run a script called HPC-Merge and this merges the different outputs from all the parts. Again, the number of tasks (count) that we need to run for this is 1.
Now AWS service limits say: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service_limits.html
The maximum tasks (count) we can run is 10. So we are at the moment able to run only 10 processes in parallel.
Meaning, Split the file (1 task runs on one instance), Run the process (task runs on 10 instances), Merge the file (task runs on 1 instance.)
The limit of 10 is limits the level at which we can parallelize our processing and I don't know how to get around. I am surprised about this limit because there is surely a need to run long running processes on more than 10 instances in the cluster.
Can you guys please give me some pointers on how to get around this limit or how to use ECS optimally to run say 20 number of tasks parallely.
The spread placement I use is 'One task per host' because the process uses all cores in one host.
How can I architect this better with ECS?
Number of tasks launched (count) per run-task
This is the maximum number of tasks that can be launched per invocation of the run-task API. To launch more tasks, call the run-task API again.
If your tasks that do the split work are architected to wait until such work is available somehow (with a queue system of some kind or whatever), I would launch them as a service and simply change the 'Desired Tasks' number from zero to 20 as needed.
When you need the workers, scale the service up to 20 Desired Tasks. Then launch your task to split the work and launch the task that waits for the work to be done. When the workers are all done, you can scale them back down to zero.
This also seems like work better suited for Fargate unless you have extreme memory or disk size needs. Otherwise you'll likely want to pair this with scaling up the EC2-based Cluster as needed and back down when not.

Running steps of EMR in parallel

I am running a spark-job on EMR cluster,The issue i am facing is all the
EMR jobs triggered are executing in steps (in queue)
Is there any way to make them run parallel
if not is there any alteration for that
Elastic MapReduce comes by default with a YARN setup very "step" oriented, with a single CapacityScheduler queue with the 100% of the cluster resources assigned. Because of this configuration, any time you submit a job to an EMR cluster, YARN maximizes the cluster usage for that single job, granting all available resources to it until it finishes.
Running multiple concurrent jobs in an EMR cluster (or any other YARN based Hadoop cluster, in fact) requires a proper YARN setup with multiple queues to properly grant resources to each job. YARN's documentation is quite good about all of the Capacity Scheduler features and it is simpler as it sounds.
YARN's FairScheduler is quite popular but it uses a different approach and may be a bit more difficult to configure depending on your needs. Given the simplest scenario where you have a single Fair queue, YARN will try to grant containers to waiting jobs as soon as they are freed by running jobs, ensuring that all the jobs submitted to a cluster get at least a fraction of compute resources as soon as they are available.
If you are concerned about YARN jobs running in a queue(submitted by spark)..
There are multiple solutions to run jobs in parallel ,
By default, EMR uses YARN CapacityScheduler with DefaultResourceCalculator and has one single DEFAULT queue where all YARN jobs are submitted. SInce there is only one queue, the number of yarn jobs that you can RUN(not submit) in parallel really depends on the parallel number of AM's , mapper and reducers that your EMR cluster supports.
For example : You have a cluster that can run atmost 10 mappers in parallel. (see AWS EMR Parallel Mappers?)
Suppose you submitted 2 map-only jobs each requiring 10 mappers one after another. The first job will take up all mapper container capacity and runs , while the second waits on the queue for the containers to free up. This behavior is similar for AM's and Reducers as well.
Now, to make them run in parallel inspire of having that limitation on number of containers that is supported by cluster ,
Keeping capacity scheduler , You can create multiple queues configuring %'s of capacity with Max capacity in each queue. So that job in first queue might not fully use up all containers even though it needs it. You can submit a seconds your job in second queue which will have pre-determined capacity.
You might need to use FAIR scheduler by configuring yarn-site.xml . The FAIR scheduler allows you share configure queues and share resources across those queues fairly. You might also use PREEMPTION option of fair scheduler.
Note that the choice of what option to go with - really depends on your use-case and business needs. It is important to learn about all options and possible impact.
https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781491901687/ch04.html
Amazon EMR now supports the ability to run multiple steps in parallel. The number of steps allowed to run at once is configurable and can be set when a cluster is launched and at any time after the cluster has started.
Please see this announcement for more details: https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/.
Just adding updated information. EMR supports parallel steps:
https://aws.amazon.com/about-aws/whats-new/2019/11/amazon-emr-now-allows-you-to-run-multiple-steps-in-parallel-cancel-running-steps-and-integrate-with-aws-step-functions/