How to add a warmup time for ecs tasks? - amazon-web-services

I am using ecs tasks to deploy my services. However, recently noticing that when I do an update on the service, the first task that starts up usually gets taken down immediately because it fails load balancer health checks. I think this is because it needs more time (a second or two more) to startup. How can I configure this?

When updating the ECS Service with a new Task Definition, I'd suggest also setting the healthCheckGracePeriodSeconds parameter. See Service Definition Parameters for more details.
If you're using the cli, you should be able to do this with aws ecs update-service --cluster <value> --service <value> --health-check-grace-period-seconds <value>.
If for example, your HealthCheck interval is 30 seconds, and 2 unhealthy checks will cause the service to be replaced, but you know your service takes, say 65 seconds to start up, you could set --health-check-grace-period-seconds 95, and this would tell ECS even if the service is considered unhealthy, let it keep going until 95 seconds would pass. I chose 95 in this case because the next healthcheck would be expected to come at 90 seconds, so if I did less than that, the load balancer may not notice the service is healthy before the unhealthy check takes effect.

Related

AWS ECS Fargate deployment optimization not working

My situation right now is that I have a CI/CD pipeline set up in GitHub Actions, this workflow does the job of deploying my app container into ECS Fargate with a set of configs needed to work. To manage my infrastructure I use Terraform to set up an Application Load Balancer and the service inside my ECS app Cluster among a lot of other things that I use in my stack.
So before I started doing some optimization the pipeline took around 15 minutes (this is way to much for hotfixes, that's the main reason I'm doing this) and after some changes in the Dockerfile and Docker build stage I managed to take this down to around 8 minutes, in which 3 minutes are used in the GitHub release tag, Docker build and push of the image to ECR and the remaining 5 minutes are used in the ECS deploy.
The thing is I found this documentation from AWS in Best Practices - Speeding up deployments for ECS and decided to do some changes in this stage too. After reading Load balancer health check parameters, Load balancer connection draining and Task deployment I changed these configs:
(Terraform) In the Application Load Balancer
deregistration_delay from 100 to 70
health_check interval from 30 to 5
health_check healthy_threshold from 5 to 3
health_check timeout to 4
(Terraform) In the ECS Service
health_check_grace_period_seconds from 100 to 20
(task-definition) In the containerDefinitions:
stopTimeout = 10
So I was expecting to go down from 150 to 15 seconds just from health_check changes and even more because of the other settings but at the time of forcing a new deploy to check the results I got almost the exact same deploy time with the same 5 minutes used in the ECS stage.
So I would like to know what setting or process am I missing to make the changes work, I looked around in my AWS console and the values where changed so the Terraform apply did work but the ECS stage definitely is taking the same time.
I find that basic ECS Fargate deployments are way slower compared to ECS EC2 deployments. Which makes sense as Fargate has more work to do. It needs to identify a host etc, whereas EC2 hosts are there, running, may have some of the required Docker layers already downloaded.
I generally find Fargate deployments take 2.5-4mins (eu-west-1) so you really need to identify where the lag is.
Some things worth checking, which might help point you in the correct direction:
When do health checks start on the new task? If they start at 4mins then the deployment is only taking 1 minute.
The overall deployment time includes time to stop + deregister the old task(s) - how long is that taking?
How long does it take for you to start your application on an empty docker service?

AWS ASG target tracking an ECS took 15 minutes to scale-in after the desired tasks of ECS is 0

I have an ECS on AWS which uses a capacity provider. The ASG associated with the capacity provider is responsible to scale-out and scale-in EC2 instances based on the ECS desired task count of ECS. It is worth mentioning that the desired task is managed by a lambda function and updated based on some metrics (calculate the depth of an SQS and based on that, change the desired task of ECS).
Scaling-out is happening almost immediately (without considering the provisioning and pending time) but when the desired task is set to zero in ECS (By lambda function), it takes at least 15 minutes for ASG to turn off the instances. Sinec we are using high performance EC2 types with large numbers, this scaling-in time costs a lot of money to us. I want to know is there any way to reduce this cooldown time to a minutes?
P.S: I have set the default cooldown to 120 but it didn't change anything

Replace ECS tasks in cluster using AWS cli

I'm trying to replace the current tasks in an ECS cluster.
Context:
I have 2 tasks (and a maximum of 4)
Every time I make a change to the docker image, the image is built, tagged, and pushed to ECR (through Jenkins). I wanted to add a timer and after x minutes, replace the current tasks with new ones (also in the CI/CD)
I tried
aws ecs update-service --cluster myCluster --service myService --task-definition myTaskDef
but it didn't work.
Also, several suggestions that I found in StackOverflow and forums, but in the best cases, I ended with 4 tasks, while, I just want to replace the current ones with new ones.
Is this possible using the CLI?
First thing as mentioned by #Marcin, in such deployed where --force-new-deployment is not specified and no change in the task definition revision the deployment will ignore by ECS agent.
The second thing that you are seeing replica after deployment is minimumHealthyPercent and maximumPercent as the service scheduler uses these parameters to determine the deployment strategy.
minimumHealthyPercent
If minimumHealthyPercent is below 100%, the scheduler can ignore
desiredCount temporarily during a deployment. For example, if
desiredCount is four tasks, a minimum of 50% allows the scheduler to
stop two existing tasks before starting two new tasks. Tasks for
services that do not use a load balancer are considered healthy if
they are in the RUNNING state. Tasks for services that use a load
balancer are considered healthy if they are in the RUNNING state and
the container instance they are hosted on is reported as healthy by
the load balancer.
maximumPercent
The maximumPercent parameter represents an upper limit on the number of running tasks during a deployment, which enables you to define the deployment batch size. For example, if desiredCount is four tasks, a maximum of 200% starts four new tasks before stopping the four older tasks (provided that the cluster resources required to do this are available).
Modifies the parameters of a service
So setting minimumHealthyPercent is to 50% the scheduled will stop one exiting task before starting one new task. setting it will 0 then you may see the bad gateway from LB as it will stop both exiting tasks before starting two one.
If you still not able to control the flow then pass the --desired-count
aws ecs update-service --cluster test --service test --task-definition test --force-new-deployment --desired-count 2
Usually you would use --force-new-deployment parameter of update-service:
Whether to force a new deployment of the service. Deployments are not forced by default. You can use this option to trigger a new deployment with no service definition changes. For example, you can update a service's tasks to use a newer Docker image with the same image/tag combination (my_image:latest ) or to roll Fargate tasks onto a newer platform version.

ECS services taking more than 10 minutes to start

I am using ECS to deploy my services, I've 2 services but after starting the ECS instance from my ASG, ecs-agent docker container comes up immediately but both of my service containers takes more than 10 minutes to come up.
I am using t2.medium instance and both these services are very small and doesn't do any checks at startup times.
Let me know if I need to provide any other information. Note I've checked in events section and even there is no information until instance is started.

Deploying new docker image with AWS ECS

I have an ECS cluster with a service in it that is running a task I have defined. It's just a simple flask server as I'm learning how to use ECS. Now I'm trying to understand how to update my app and have it seamlessly deploy.
I start with the flask server returning Hello, World! (rev=1).
I modify my app.py locally to say Hello, World! (rev=2)
I rebuild the docker image, and push to ECR
Since my image is still named image_name:latest, I can simply update the service and force a new deployment with: aws ecs update-service --force-new-deployment --cluster hello-cluster --service hello-service
My minimum percent is set to 100 and my maximum is set to 200% (using rolling updates), so I'm assuming that a new EC2 instance should be set up while the old one is being shutdown. What I observe (continually refreshing the ELB HTTP endpoint) is that that the rev=? in the message alternates back and forth: (rev=1) then (rev=2) without fail (round robin, not randomly).
Then after a little bit (maybe 30 secs?) the flipping stops and the new message appears: Hello, World! (rev=2)
Throughout this process I've noticed that no more EC2 instances have been started. So all this must have been happening on the same instance.
What is going on here? Is this the correct way to update an application in ECS?
This is the normal behavior and it's linked to how you configured your minimum and maximum healthy percent.
A minimum healthy percent of 100% means that at every moment there must be at least 1 task running (for a service that should run 1 instance of your task). A maximum healthy percent of 200% means that you don't allow more than 2 tasks running at the same time (again for a service that should run 1 instance of your task). This means that during a service update ECS will first launch a new task (reaching the maximum of 200% and avoiding to go below 100%) and when this new task is considered healthy it will remove the old one (back to 100%). This explains why both tasks are running at the same time for a short period of time (and are load balanced).
This kind of configuration ensures maximum availability. If you want to avoid this, and can allow a small downtime, you can configure your minimum to 0% and maximum to 100%.
About your EC2 instances: they represent your "cluster" = the hardware that your service use to launch tasks. The process described above happens on this "fixed" hardware.