Updating ECS service with Terraform fails to place a new task - amazon-web-services

After pushing a new image of my container I use Terraform apply to update the task definition. This seems to work fine but in the ECS service list of tasks I can see the task as inactive and I have an event:
service blahblah was unable to place a task because no container instance met all of its requirements. The closest matching container-instance [guid here] is already using a port required by your task.
The thing is, the site is still active and working.

This is more of an ECS issue than a Terraform issue because Terraform is updating your task definition and updating the service to use the new task definition but ECS is unable to schedule new tasks on to the container instances because you're (presumably) defining a specific port that the container must run on and directly mapping it to the host or using host networking instead of bridge (or the new aws-vpc CNI plugin).
ECS has a couple of parameters to control the behaviour of an update to the service: minimum healthy percent and maximum healthy percent. By default these are set to 100% and 200% respectively meaning that ECS will attempt to deploy a new task matching the new task definition and wait for it to be considered healthy (such as passing ELB health checks) before terminating the old tasks.
In your case you have as many tasks as you have container instances in your cluster and so when it attempts to schedule a new task on to the cluster it is unable to place it because the port is already bound to by the old task. You could also find yourself in this position if you had placement constraints on your task/service.
Because the minimum healthy percent is set to 100% it is unable to schedule the removal of any of the old tasks that would then free up a placement option for a new task.
You could have more container instances in the cluster than you have instances of the task running which would allow ECS to deploy new tasks before removing old tasks from the other instances or you could change the minimum healthy percent (deployment_minimum_healthy_percent in Terraform's ECS service resource) to a number less than 100 that allows deployments to happen.
For example, if you normally deploy 3 instances of the task in the service then setting the minimum healthy percent to 50% would allow ECS to remove one task from the service before scheduling a new task matching the new task definition. It would then proceed with a rolling upgrade, making sure the new task is healthy before replacing the old task.
Setting the minimum healthy percent to 0% would mean that ECS can stop all of the tasks running before starting new tasks but this would obviously lead to a potential (but not guaranteed) service interruption.
Alternatively you could remove the placement constraint by switching away from host networking if that is viable for your service.

Related

Replace ECS tasks in cluster using AWS cli

I'm trying to replace the current tasks in an ECS cluster.
Context:
I have 2 tasks (and a maximum of 4)
Every time I make a change to the docker image, the image is built, tagged, and pushed to ECR (through Jenkins). I wanted to add a timer and after x minutes, replace the current tasks with new ones (also in the CI/CD)
I tried
aws ecs update-service --cluster myCluster --service myService --task-definition myTaskDef
but it didn't work.
Also, several suggestions that I found in StackOverflow and forums, but in the best cases, I ended with 4 tasks, while, I just want to replace the current ones with new ones.
Is this possible using the CLI?
First thing as mentioned by #Marcin, in such deployed where --force-new-deployment is not specified and no change in the task definition revision the deployment will ignore by ECS agent.
The second thing that you are seeing replica after deployment is minimumHealthyPercent and maximumPercent as the service scheduler uses these parameters to determine the deployment strategy.
minimumHealthyPercent
If minimumHealthyPercent is below 100%, the scheduler can ignore
desiredCount temporarily during a deployment. For example, if
desiredCount is four tasks, a minimum of 50% allows the scheduler to
stop two existing tasks before starting two new tasks. Tasks for
services that do not use a load balancer are considered healthy if
they are in the RUNNING state. Tasks for services that use a load
balancer are considered healthy if they are in the RUNNING state and
the container instance they are hosted on is reported as healthy by
the load balancer.
maximumPercent
The maximumPercent parameter represents an upper limit on the number of running tasks during a deployment, which enables you to define the deployment batch size. For example, if desiredCount is four tasks, a maximum of 200% starts four new tasks before stopping the four older tasks (provided that the cluster resources required to do this are available).
Modifies the parameters of a service
So setting minimumHealthyPercent is to 50% the scheduled will stop one exiting task before starting one new task. setting it will 0 then you may see the bad gateway from LB as it will stop both exiting tasks before starting two one.
If you still not able to control the flow then pass the --desired-count
aws ecs update-service --cluster test --service test --task-definition test --force-new-deployment --desired-count 2
Usually you would use --force-new-deployment parameter of update-service:
Whether to force a new deployment of the service. Deployments are not forced by default. You can use this option to trigger a new deployment with no service definition changes. For example, you can update a service's tasks to use a newer Docker image with the same image/tag combination (my_image:latest ) or to roll Fargate tasks onto a newer platform version.

ECS unable to place task despite increasing instance count

I'm facing the following problem when creating new instances and increasing the container desired count at the same time. Since the instances are not running when I increase the desired count, I get a "service XXX was unable to place a task because no container instance met all of its requirements.". A few seconds later the new instances are up, however, the cluster still has "Desire count: 30, Pending count: 0, Running count: 3". In other words, the cluster does not "know" that there are new instances and no new containers are created.
How can I avoid this situation? Is there a parameter that instructs the cluster to monitor the instance count other than immediately after an increase in desired count?
In this case its an expected behavior of ECS, reason being that ECS service scheduler includes circuit breaker logic that throttles how often tasks are placed if they repeatedly fail to launch.
When a new container instance in spined up it takes some time to get
register to the Cluster and it looks like service is getting throttled
because time taken from increase in desired count to registration of
container instances to the cluster.
Having said that, if you wait for ~15 minutes after scaling number of instance in the cluster, Service scheduler will start placing the task on new container instances.
To avoid this situation, ECS Cluster should be autoscaled based on Custer reservation metric, by this ECS cluster will have additional capacity beforehand to accommodate new task count.
and here is a tutorial on scaling ECS cluster.

Deploying new docker image with AWS ECS

I have an ECS cluster with a service in it that is running a task I have defined. It's just a simple flask server as I'm learning how to use ECS. Now I'm trying to understand how to update my app and have it seamlessly deploy.
I start with the flask server returning Hello, World! (rev=1).
I modify my app.py locally to say Hello, World! (rev=2)
I rebuild the docker image, and push to ECR
Since my image is still named image_name:latest, I can simply update the service and force a new deployment with: aws ecs update-service --force-new-deployment --cluster hello-cluster --service hello-service
My minimum percent is set to 100 and my maximum is set to 200% (using rolling updates), so I'm assuming that a new EC2 instance should be set up while the old one is being shutdown. What I observe (continually refreshing the ELB HTTP endpoint) is that that the rev=? in the message alternates back and forth: (rev=1) then (rev=2) without fail (round robin, not randomly).
Then after a little bit (maybe 30 secs?) the flipping stops and the new message appears: Hello, World! (rev=2)
Throughout this process I've noticed that no more EC2 instances have been started. So all this must have been happening on the same instance.
What is going on here? Is this the correct way to update an application in ECS?
This is the normal behavior and it's linked to how you configured your minimum and maximum healthy percent.
A minimum healthy percent of 100% means that at every moment there must be at least 1 task running (for a service that should run 1 instance of your task). A maximum healthy percent of 200% means that you don't allow more than 2 tasks running at the same time (again for a service that should run 1 instance of your task). This means that during a service update ECS will first launch a new task (reaching the maximum of 200% and avoiding to go below 100%) and when this new task is considered healthy it will remove the old one (back to 100%). This explains why both tasks are running at the same time for a short period of time (and are load balanced).
This kind of configuration ensures maximum availability. If you want to avoid this, and can allow a small downtime, you can configure your minimum to 0% and maximum to 100%.
About your EC2 instances: they represent your "cluster" = the hardware that your service use to launch tasks. The process described above happens on this "fixed" hardware.

Replace ECS container instances in terraform setup

We have a terraform deployment that creates an auto-scaling group for EC2 instances that we use as docker hosts in an ECS cluster. On the cluster there are tasks running. Replacing the tasks (e.g. with a newer version) works fine (by creating a new task definition revision and updating the service -- AWS will perform a rolling update). However, how can I easily replace the EC2 host instances with newer ones without any downtime?
I'd like to do this to e.g. have a change to the ASG launch configuration take effect, for example switching to a different EC2 instance type.
I've tried a few things, here's what I think gets closest to what I want:
Drain one instance. The tasks will be distributed to the remaining instances.
Once no tasks are running in that instance anymore, terminate it.
Wait for the ASG to spin up a new instance.
Repeat steps 1 to 3 until all instances are new.
This works almost. The problem is that:
It's manual and therefore error prone.
After this process one of the instances (the last one that was spun up) is running 0 (zero) tasks.
Is there a better, automated way of doing this? Also, is there a way to re-distribute the tasks in an ECS cluster (without creating a new task revision)?
Prior to making changes make sure you have the ASG spanned across multiple availability zones and so are the containers. This ensures High Availability when instances are down in one Zone.
You can configure an update policy of Autoscaling group with AutoScalingRollingUpgrade where you can set MinInstanceInService and MinSuccessfulInstancesPercent to a higher value to maintain slow and safe rolling upgrade.
You may go through this documentation to find further tweaks. To automate this process, you can use terraform to update the ASG launch configuration, this will update the ASG with a new version of launch configuration and trigger a rolling upgrade.

Updating an AWS ECS Service

I have a service running on AWS EC2 Container Service (ECS). My setup is a relatively simple one. It operates with a single task definition and the following details:
Desired capacity set at 2
Minimum healthy set at 50%
Maximum available set at 200%
Tasks run with 80% CPU and memory reservations
Initially, I am able to get the necessary EC2 instances registered to the cluster that holds the service without a problem. The associated task then starts running on the two instances. As expected – given the CPU and memory reservations – the tasks take up almost the entirety of the EC2 instances' resources.
Sometimes, I want the task to use a new version of the application it is running. In order to make this happen, I create a revision of the task, de-register the previous revision, and then update the service. Note that I have set the minimum healthy percentage to require 2 * 0.50 = 1 instance running at all times and the maximum healthy percentage to permit up to 2 * 2.00 = 4 instances running.
Accordingly, I expected 1 of the de-registered task instances to be drained and taken offline so that 1 instance of the new revision of the task could be brought online. Then the process would repeat itself, bringing the deployment to a successful state.
Unfortunately, the cluster does nothing. In the events log, it tells me that it cannot place the new tasks, even though the process I have described above would permit it to do so.
How can I get the cluster to perform the behavior that I am expecting? I have only been able to get it to do so when I manually register another EC2 instance to the cluster and then tear it down after the update is complete (which is not desirable).
I have faced the same issue where the tasks used to get stuck and had no space to place them. Below snippet from AWS doc on updating a service helped me to make the below decision.
If your service has a desired number of four tasks and a maximum
percent value of 200%, the scheduler may start four new tasks before
stopping the four older tasks (provided that the cluster resources
required to do this are available). The default value for maximum
percent is 200%.
We should have the cluster resources available / container instances available to have the new tasks get started so they can start and the older one can drain.
These are the things i do
Before doing a service update add like 20% capacity to your cluster. You can use the ASG (Autoscaling group) commandline and from the desired capacity add 20% to your cluster. This way you will have some additional instance during deployment.
Once you have the instance the new tasks will start spinning up quickly and the older one will start draining.
But does this mean i will have extra container instances ?
Yes, during the deployment you will add some instances but as the older tasks drain they will hang around. The way to remove them is
Create a MemoryReservationLow alarm (~70% threshold in your case) for like 25 mins (longer duration to be sure that we have over commissioned). As the reservation will go low once you have those extra server not being used they can be removed.
I have seen this before. If your port mapping is attempting to map a static host port to the container within the task, you need more cluster instances.
Also this could be because there is not enough available memory to meet the memory (soft or hard) limit requested by the container within the task.