ECS unable to place task despite increasing instance count - amazon-web-services

I'm facing the following problem when creating new instances and increasing the container desired count at the same time. Since the instances are not running when I increase the desired count, I get a "service XXX was unable to place a task because no container instance met all of its requirements.". A few seconds later the new instances are up, however, the cluster still has "Desire count: 30, Pending count: 0, Running count: 3". In other words, the cluster does not "know" that there are new instances and no new containers are created.
How can I avoid this situation? Is there a parameter that instructs the cluster to monitor the instance count other than immediately after an increase in desired count?

In this case its an expected behavior of ECS, reason being that ECS service scheduler includes circuit breaker logic that throttles how often tasks are placed if they repeatedly fail to launch.
When a new container instance in spined up it takes some time to get
register to the Cluster and it looks like service is getting throttled
because time taken from increase in desired count to registration of
container instances to the cluster.
Having said that, if you wait for ~15 minutes after scaling number of instance in the cluster, Service scheduler will start placing the task on new container instances.
To avoid this situation, ECS Cluster should be autoscaled based on Custer reservation metric, by this ECS cluster will have additional capacity beforehand to accommodate new task count.
and here is a tutorial on scaling ECS cluster.

Related

EC2 Auto Scaling Group's Instance refresh goes below Healthy threshold

I have an ASG with desired/min/max of 1/1/5 instances (I want ASG just for rolling deploys and zone failover). When I start the Instance refresh with MinHealthyPercentage=100,InstanceWarmup=180, the process starts by deregistration (the instance goes to draining mode almost immediately on my ALB, instead waiting the 180 Warmup seconds until the new instance is healthy) and the application becomes unavailable for a while.
Note that this is not specific just to my case with one instance. If I had two instances, the process also starts by deregistering one of the instances and that does not fulfill the 100% MinHealthy constraint either (the app will stay available, though)!
Is there any other configuration option I should tune to get the rolling update create and warm up the new instance first?
Currently instance refresh always terminates before launching, and it uses the minHealthyPercent to determine batch size and when it can move on to the next batch.
It takes a set of instances out of service, terminates them, and launches a set of instances with the new desired configuration. Then, it waits until the instances pass your health checks and complete warmup before it moves on to replacing other instances.
...
Setting the minimum healthy percentage to 100 percent limits the rate of replacement to one instance at a time. In contrast, setting it to 0 percent causes all instances to be replaced at the same time.
https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-instance-refresh.html
If you are running the 1 instance and using the Launch template with the Autoscaling it would be hard to rolling update the EC2 instance.
i am coming from the above scenario and hitting up on this immature feature of AWS.
it's mentioned in the limitation of instance refresh, it will scale down the instance and will recreate the new one instead of creating the first new one instance.
Instances terminated before launch: When there is only one instance in
the Auto Scaling group, starting an instance refresh can result in an
outage. This is because Amazon EC2 Auto Scaling terminates an instance
and then launches a new instance.
Ref : https://docs.aws.amazon.com/autoscaling/ec2/userguide/asg-instance-refresh.html
i tried work around of scaling up the auto-scaling group desired size to 2, it will create a new instance with the latest AMI in the launch template.
Now you have two instances running the old version & latest version, you will be good to set the desired capacity now back to 1 in the auto-scaling group.
Auto-scaling desired capacity to 1 will delete the older instance and keep the latest instance with the latest AMI.
Command to update desired capacity to 2
- aws autoscaling update-auto-scaling-group --auto-scaling-group-name $ASG_GROUP --desired-capacity 2
Command to update desired capacity to 1
- aws autoscaling update-auto-scaling-group --auto-scaling-group-name $ASG_GROUP --desired-capacity 1
Instead of using the instance-refresh this worked well for me.
This does not seem to be the case anymore. An instance refresh creates now a fresh instance and terminates the old one after health checks are successful. AWS Support mentioned this behavior was not changed since 2020.

ECS Updating daemon service when desired running count is 1

If a service is using the rolling update (ECS) deployment type, the
minimum healthy percent represents a lower limit on the number of
tasks in a service that must remain in the RUNNING state during a
deployment, as a percentage of the desired number of tasks (rounded up
to the nearest integer). The parameter also applies while any
container instances are in the DRAINING state if the service contains
tasks using the EC2 launch type. This parameter enables you to deploy
without using additional cluster capacity. For example, if your
service has a desired number of four tasks and a minimum healthy
percent of 50%, the scheduler may stop two existing tasks to free up
cluster capacity before starting two new tasks. Tasks for services
that do not use a load balancer are considered healthy if they are in
the RUNNING state. Tasks for services that do use a load balancer are
considered healthy if they are in the RUNNING state and they are
reported as healthy by the load balancer. The default value for
minimum healthy percent is 100%.
(https://stackoverflow.com/a/40741816/433570, https://docs.aws.amazon.com/AmazonECS/latest/developerguide/update-service.html)
So If I have single instance running,
If I have minimum percentage below 50%, the running task will be killed before creating new task.
If I have minimum percentage above or equal to 50% , new deployment won't be deployed, because it is stuck at having 1 instance.
I have to temporarily increase the desired instance count to 2 and do a deployement if I want to deploy a task without service downtime. correct?

How to launch new tasks on ecs instances which come up in autoscaling group

I have an ECS Cluster which has two instances as defined in an autoscaling group with 2 as minimum capacity,
I have defined the ecs service to run two containers per instance when it is created or updated. So it launches two containers per ecs instances in the ecs cluster.
Now, suppose when I stop/terminate an instance in that cluster a new instance will automatically come up since the autoscaling group has a minimum capacity of two.
The problem is when the new instance come up in the autoscaling group it does not run two tasks which are defined to be in service, instead, it runs 4 tasks on one ecs instance and the other new ecs instance doesn't have any task running on it.
How could I manage that whenever a new instance come up in Auto Scaling group it also has those two tasks running?
if you want those two ec2 instance to be dedicated for those 4 tasks then you can modify task definition memory limits and make it require half of your 1 ecs instance memory.
Let's say you have t3.small then your task definitions limits would be 1gb for memory limit. in this way if you have one t3.small instance you will get only 2 tasks running on it. whenever you add another t3.small instance you should fulfil the missing required memory and another two tasks will run on that new t3.small instance.
You can also consider running 1 task per ecs instance, to do so in service creation choose to have Deamon service type. and give more memory to your task in task definition. so every new ec2 instance will have 1 running task for this service all the time.

Updating ECS service with Terraform fails to place a new task

After pushing a new image of my container I use Terraform apply to update the task definition. This seems to work fine but in the ECS service list of tasks I can see the task as inactive and I have an event:
service blahblah was unable to place a task because no container instance met all of its requirements. The closest matching container-instance [guid here] is already using a port required by your task.
The thing is, the site is still active and working.
This is more of an ECS issue than a Terraform issue because Terraform is updating your task definition and updating the service to use the new task definition but ECS is unable to schedule new tasks on to the container instances because you're (presumably) defining a specific port that the container must run on and directly mapping it to the host or using host networking instead of bridge (or the new aws-vpc CNI plugin).
ECS has a couple of parameters to control the behaviour of an update to the service: minimum healthy percent and maximum healthy percent. By default these are set to 100% and 200% respectively meaning that ECS will attempt to deploy a new task matching the new task definition and wait for it to be considered healthy (such as passing ELB health checks) before terminating the old tasks.
In your case you have as many tasks as you have container instances in your cluster and so when it attempts to schedule a new task on to the cluster it is unable to place it because the port is already bound to by the old task. You could also find yourself in this position if you had placement constraints on your task/service.
Because the minimum healthy percent is set to 100% it is unable to schedule the removal of any of the old tasks that would then free up a placement option for a new task.
You could have more container instances in the cluster than you have instances of the task running which would allow ECS to deploy new tasks before removing old tasks from the other instances or you could change the minimum healthy percent (deployment_minimum_healthy_percent in Terraform's ECS service resource) to a number less than 100 that allows deployments to happen.
For example, if you normally deploy 3 instances of the task in the service then setting the minimum healthy percent to 50% would allow ECS to remove one task from the service before scheduling a new task matching the new task definition. It would then proceed with a rolling upgrade, making sure the new task is healthy before replacing the old task.
Setting the minimum healthy percent to 0% would mean that ECS can stop all of the tasks running before starting new tasks but this would obviously lead to a potential (but not guaranteed) service interruption.
Alternatively you could remove the placement constraint by switching away from host networking if that is viable for your service.

Updating an AWS ECS Service

I have a service running on AWS EC2 Container Service (ECS). My setup is a relatively simple one. It operates with a single task definition and the following details:
Desired capacity set at 2
Minimum healthy set at 50%
Maximum available set at 200%
Tasks run with 80% CPU and memory reservations
Initially, I am able to get the necessary EC2 instances registered to the cluster that holds the service without a problem. The associated task then starts running on the two instances. As expected – given the CPU and memory reservations – the tasks take up almost the entirety of the EC2 instances' resources.
Sometimes, I want the task to use a new version of the application it is running. In order to make this happen, I create a revision of the task, de-register the previous revision, and then update the service. Note that I have set the minimum healthy percentage to require 2 * 0.50 = 1 instance running at all times and the maximum healthy percentage to permit up to 2 * 2.00 = 4 instances running.
Accordingly, I expected 1 of the de-registered task instances to be drained and taken offline so that 1 instance of the new revision of the task could be brought online. Then the process would repeat itself, bringing the deployment to a successful state.
Unfortunately, the cluster does nothing. In the events log, it tells me that it cannot place the new tasks, even though the process I have described above would permit it to do so.
How can I get the cluster to perform the behavior that I am expecting? I have only been able to get it to do so when I manually register another EC2 instance to the cluster and then tear it down after the update is complete (which is not desirable).
I have faced the same issue where the tasks used to get stuck and had no space to place them. Below snippet from AWS doc on updating a service helped me to make the below decision.
If your service has a desired number of four tasks and a maximum
percent value of 200%, the scheduler may start four new tasks before
stopping the four older tasks (provided that the cluster resources
required to do this are available). The default value for maximum
percent is 200%.
We should have the cluster resources available / container instances available to have the new tasks get started so they can start and the older one can drain.
These are the things i do
Before doing a service update add like 20% capacity to your cluster. You can use the ASG (Autoscaling group) commandline and from the desired capacity add 20% to your cluster. This way you will have some additional instance during deployment.
Once you have the instance the new tasks will start spinning up quickly and the older one will start draining.
But does this mean i will have extra container instances ?
Yes, during the deployment you will add some instances but as the older tasks drain they will hang around. The way to remove them is
Create a MemoryReservationLow alarm (~70% threshold in your case) for like 25 mins (longer duration to be sure that we have over commissioned). As the reservation will go low once you have those extra server not being used they can be removed.
I have seen this before. If your port mapping is attempting to map a static host port to the container within the task, you need more cluster instances.
Also this could be because there is not enough available memory to meet the memory (soft or hard) limit requested by the container within the task.