I have implemented the Job Observer Pattern using SQS and ECS. Job descriptions are pushed to the SQS queue for processing. The job processing run on an ECS Cluster within an Auto-Scaling Group running ECS Docker Tasks.
Each ECS Task does:
Read message from SQS queue
Execute job on data (~1 hour)
Delete message
Loop while there are more messages
I would like to scale down the cluster when there is no more work for each Instance, eventually to zero instances.
Looking at this similar post, the answers suggest scale-in would need to be handled outside of ASG in some way. Instances would self-scale-in, either by explicitly self-terminating or by toggling ASG Instance Protection off when there are no more messages.
This also doesn't handle the case of running multiple ECS Tasks on a single instance, as an individual task shouldn't terminate if other Tasks are running in parallel.
Am I limited to self scale-in and only one Task per Instance? Any way to only terminate once all ECS Tasks on an instance have exited? Any other scale-in alternatives?
You could use CloudWatch Alarms with Actions:
detect and terminate worker instances that have been idle for a certain period of time
I ended up using:
A Scale Out Policy that Adds the same number of instances as pending SQS queue messages
A Scale In Policy that Sets to Zero instances once the SQS queue is empty
Enabling ASG Instance Protection at the start of the batch job and disabling it at the end
This restricts me to one batch job per instance, but worked well for my scenario.
Another solution for the problem is the AWS Batch service announced at the end of 2016.
Related
I have ECS container running some tasks. The server running inside the task may take 1~10 minutes to complete one request.
I am using SQS for task queuing. When certain amount tasks exceeds it scale-up the ECS tasks. And it scale-down when task in queue go below certain numbers.
However, as there is no LifeCycleHook feature for ECS task, during the time of scale-down the ECS tasks are shut down while the processing is still running. And it's not possible to delay the task termination due to the lack of LifeCycleHook.
According to our specification, we can't use the timeout feature, as we don't know earlier how much time it will take to finish the job.
Please suggest how to solve the problem.
There is no general solution to this problem, especially if you don't want to use timeout. In fact there is long lasting, still open, github issue dedicated to this:
[ECS] [request]: Control which containers are terminated on scale in
You could somehow control this through running your services on EC2 (EC2 scale-in protection), not Fargate. So either you have to re-architect your solution, or manually scale-out and in your service.
I want to run Druid on EKS but was concerned about using EC2 autoscaling groups to scale my middle managers. If every middle manager is running an ingestion task but AWS decides to scale down, will a middle manager be terminated or will there be termination protection in place? If so, what other alternatives to scaling do people suggest?
A signal will be sent to your containers to give them an opportunity to shutdown gracefully. This is part of lifecycle management.
By default, the orchestrator will wait 30 seconds before forcefully stopping the container. You can adjust this by setting terminationGracePeriodSeconds. You can also add hooks like PostStart or PreStop to do any extra operations to ensure consistency in your system.
See also: EC2 Autoscaling lifecycle hooks
Is there a way to ensure an AWS ECS container instance doesn't shut down in the middle of running a critical task?
I have an auto-scaling AWS ECS service that scales the number of instances based on CPU usage. These instances process long-running batch jobs that may take anywhere from 5 to 30 minutes.
The problem is that sometimes, during a scale-down, an instance that's actively running a critical job gets shut down which ultimately causes the job to fail.
You can use a feature called managed termination protection.
When the scaling policy reduces the number of instances, it has no control over which instances actually terminate. The default behavior of the auto-scaling group may well terminate instances that are running tasks, even though there are instances not running tasks. This is where managed termination protection comes into the picture. With this option enabled, ECS dynamically manage instance termination protection on your behalf.
Please have a look at Controlling which Auto Scaling instances terminate during scale in and specifically the section Instance scale-in protection in the AWS documentation.
I currently have a Fargate cluster that contains a service. This service always has 1 task running and is polling from SQS. The service will scale the number of tasks if SQS grows/shrinks. However, the task has a lot of idle time, where there are no messages in the queue. To save on costs, is it possible to make the service go down to 0 task?
I have been trying to do this and the service will always try to start at least 1 task.
If this is not possible, then would it be best practice for me to not use a service and have a CloudWatch alarm on SQS and just create a task directly in the cluster when the size is greater than 0, and then shut down the task when the SQS is back to 0? Essentially mimicking the functionality of a service.
Yes you can. You can also use a Target Tracking Policy that allow you to scale more efficiently than a Step Scaling Policy.
See https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-using-sqs-queue.html for more details (it's about EC2 but works for ECS as well).
Amazon ECS provides really good service for scheduled tasks : ECS Scheduled tasks that works pretty well.
However it's important in this always keep one ECS instance in ECS cluster.
What is the best way:
Launch/scale in ECS instance in for periodical job (just before task execution);
Run ECS tasks on newly created instance;
Terminate/scale out instance after completion.
One possible workaround is to write lambda that will do smth. like that (launch ec2) but it looks as too much pain.
Finally I found out an easy solution for that problem. Everything was quite simple:
Go to Autoscaling groups (This you can find on EC2 dashboard-> Autoscaling section);
Create scheduled action (In that case necessary frequency can be specified for your container instance);
Save your configuration. Instance will be added in the specified time.
In my case I also need to scale down this instance in 1 hour period.