I have ECS container running some tasks. The server running inside the task may take 1~10 minutes to complete one request.
I am using SQS for task queuing. When certain amount tasks exceeds it scale-up the ECS tasks. And it scale-down when task in queue go below certain numbers.
However, as there is no LifeCycleHook feature for ECS task, during the time of scale-down the ECS tasks are shut down while the processing is still running. And it's not possible to delay the task termination due to the lack of LifeCycleHook.
According to our specification, we can't use the timeout feature, as we don't know earlier how much time it will take to finish the job.
Please suggest how to solve the problem.
There is no general solution to this problem, especially if you don't want to use timeout. In fact there is long lasting, still open, github issue dedicated to this:
[ECS] [request]: Control which containers are terminated on scale in
You could somehow control this through running your services on EC2 (EC2 scale-in protection), not Fargate. So either you have to re-architect your solution, or manually scale-out and in your service.
Related
I've been experiencing this with my ECS service for a few months now. Previously, when we would update the service with a new task definition, it would perform the rolling update correctly, deregistering them from the target group and draining all http connections to the old tasks before eventually stopping them. However, lately ECS is going straight to stopping the old tasks before draining connections or removing them from the target group. This is resulting in 8-12 seconds of API down time for us while new http requests continue to be routed to the now-stopped tasks that are still in the target group. This happens now whether we trigger the service update via the CLI or the console - same behaviour. Shown here are a screenshot showing a sample sequence of Events from ECS demonstrating the issue as well as the corresponding ECS agent logs for the same instance.
Of particular note when reviewing these ECS agent logs against the sequence of events is that the logs do not have an entry at 21:04:50 when the task was stopped. This feels like a clue to me, but I'm not sure where to go from here with it. Has anyone experienced something like this, or have any insights as to why the tasks wouldn't drain and be removed from the target group before being stopped?
For reference, the service is behind an AWS application load balancer. Happy to provide additional details if someone thinks of what else may be relevant
It turns out that ECS changed the timing of when the events would be logged in the UI in the screenshot. In fact, the targets were actually being drained before being stopped. The "stopped n running task(s)" message is now logged at the beginning of the task shutdown lifecycle steps (before deregistration) instead of at the end (after deregistration) like it used to.
That said, we were still getting brief downtime spikes on our service at the load balancer level during deployments, but ultimately this turned out to be because of the high startup overhead on the new versions of the tasks spinning up briefly pegging the CPU of the instances in the cluster to 100% when there was also sufficient taffic happening during the deployment, thus causing some requests to get dropped.
A good-enough for now solution was to adjust our minimum healthy deployment percentage up to 100% and set the maximum deployment percentage to 150% (as opposed to the old 200% setting), which forces the deployments to "slow down", only launching 50% of the intended new tasks at a time and waiting until they are stable before launching the rest. This spreads out the high task startup overhead to two smaller CPU spikes rather than one large one and has so far successfully prevented any more downtime during deployments. We'll also be looking into reducing the startup overhead itself. Figured I'd update this in case it helps anyone else out there.
I am running java process inside ecs fargate containers and have set-up auto scaling to scale-out when memory utilization is above 60% and scale-in accordingly. This setup is working fine but i am not able to figure out the criteria based upon which ecs determines which tasks it should shutdown as part of the scale-in events i.e how does it distinguishes between different tasks and picks one to shutdown ?
Does it check if there any active requests on the tasks or not and then if there are multiple such tasks then picks randomly ?
There is a years long open issue about that on github:
Control which containers are terminated on scale in
From the issue and its comments you can infer the following:
Does it check if there any active requests on the tasks
No.
if there are multiple such tasks then picks randomly ?
Its random.
There is actually an update of this, now you can make your running task to be protected. Check this one for more details
https://aws.amazon.com/premiumsupport/knowledge-center/ecs-fargate-service-auto-scaling/
I want to run Druid on EKS but was concerned about using EC2 autoscaling groups to scale my middle managers. If every middle manager is running an ingestion task but AWS decides to scale down, will a middle manager be terminated or will there be termination protection in place? If so, what other alternatives to scaling do people suggest?
A signal will be sent to your containers to give them an opportunity to shutdown gracefully. This is part of lifecycle management.
By default, the orchestrator will wait 30 seconds before forcefully stopping the container. You can adjust this by setting terminationGracePeriodSeconds. You can also add hooks like PostStart or PreStop to do any extra operations to ensure consistency in your system.
See also: EC2 Autoscaling lifecycle hooks
Is there a way to ensure an AWS ECS container instance doesn't shut down in the middle of running a critical task?
I have an auto-scaling AWS ECS service that scales the number of instances based on CPU usage. These instances process long-running batch jobs that may take anywhere from 5 to 30 minutes.
The problem is that sometimes, during a scale-down, an instance that's actively running a critical job gets shut down which ultimately causes the job to fail.
You can use a feature called managed termination protection.
When the scaling policy reduces the number of instances, it has no control over which instances actually terminate. The default behavior of the auto-scaling group may well terminate instances that are running tasks, even though there are instances not running tasks. This is where managed termination protection comes into the picture. With this option enabled, ECS dynamically manage instance termination protection on your behalf.
Please have a look at Controlling which Auto Scaling instances terminate during scale in and specifically the section Instance scale-in protection in the AWS documentation.
Problem: Fargate tasks are being shut down without completing the processes within the task upon scaling in. (Auto Scaling implemented)
Is there a possibility for the Fargate task to exit gracefully (to complete all the processes within the task before shutting it down)?
There is a way in EC2 to handle this through Life cycle hooks but I'm not sure if there is anything similar in the Amazon Fargate cluster.
Capture the SIGTERM signal and do your cleanup in there. You can trap it in your application, using whatever programming language that you want, or trap it in a shell entrypoint script.
For more information see this blogpost from AWS.
In 2022 AWS introduced the Task scale-in protection endpoint.
The following task scale-in protection endpoint path is available to containers: $ECS_AGENT_URI/task-protection/v1/state