ECS services taking more than 10 minutes to start - amazon-web-services

I am using ECS to deploy my services, I've 2 services but after starting the ECS instance from my ASG, ecs-agent docker container comes up immediately but both of my service containers takes more than 10 minutes to come up.
I am using t2.medium instance and both these services are very small and doesn't do any checks at startup times.
Let me know if I need to provide any other information. Note I've checked in events section and even there is no information until instance is started.

Related

AWS ECS Fargate deployment optimization not working

My situation right now is that I have a CI/CD pipeline set up in GitHub Actions, this workflow does the job of deploying my app container into ECS Fargate with a set of configs needed to work. To manage my infrastructure I use Terraform to set up an Application Load Balancer and the service inside my ECS app Cluster among a lot of other things that I use in my stack.
So before I started doing some optimization the pipeline took around 15 minutes (this is way to much for hotfixes, that's the main reason I'm doing this) and after some changes in the Dockerfile and Docker build stage I managed to take this down to around 8 minutes, in which 3 minutes are used in the GitHub release tag, Docker build and push of the image to ECR and the remaining 5 minutes are used in the ECS deploy.
The thing is I found this documentation from AWS in Best Practices - Speeding up deployments for ECS and decided to do some changes in this stage too. After reading Load balancer health check parameters, Load balancer connection draining and Task deployment I changed these configs:
(Terraform) In the Application Load Balancer
deregistration_delay from 100 to 70
health_check interval from 30 to 5
health_check healthy_threshold from 5 to 3
health_check timeout to 4
(Terraform) In the ECS Service
health_check_grace_period_seconds from 100 to 20
(task-definition) In the containerDefinitions:
stopTimeout = 10
So I was expecting to go down from 150 to 15 seconds just from health_check changes and even more because of the other settings but at the time of forcing a new deploy to check the results I got almost the exact same deploy time with the same 5 minutes used in the ECS stage.
So I would like to know what setting or process am I missing to make the changes work, I looked around in my AWS console and the values where changed so the Terraform apply did work but the ECS stage definitely is taking the same time.
I find that basic ECS Fargate deployments are way slower compared to ECS EC2 deployments. Which makes sense as Fargate has more work to do. It needs to identify a host etc, whereas EC2 hosts are there, running, may have some of the required Docker layers already downloaded.
I generally find Fargate deployments take 2.5-4mins (eu-west-1) so you really need to identify where the lag is.
Some things worth checking, which might help point you in the correct direction:
When do health checks start on the new task? If they start at 4mins then the deployment is only taking 1 minute.
The overall deployment time includes time to stop + deregister the old task(s) - how long is that taking?
How long does it take for you to start your application on an empty docker service?

Cannot run more than two tasks in Amazon Web Services

I have two clusters in my Amazon Elastic Container Service, one for production and one as a testing environment.
Each cluster has three different services with one task each. There should be 6 tasks running.
To update a task, I always pushed my new Docker Image to the Elastic Container Registry and restarted the Service with the new Image.
Since about 2 weeks I am only able to start 2 Tasks at all. It doesn't depend on the cluster, just 2 Tasks in general.
It looks like the tasks that should start are stuck in the "In Progress" Rollout State.
Has anybody similar problem or knows how to fix this?
I wrote to the support with this issue.
"After a review, I have noticed that the XXXXXXX region has not yet been activated. In order to activate the region you will have to launch an instance, I recommended a Free Tier EC2 instance.
After the EC2 instance has been launched you can terminate it thereafter.
"
I don't know why, but it's working

Running ECS service on 2 container instances (ECS instances)

I have an ECS service which has a requirement that it should be running on exactly 2 container instances. How can this be achieved? I could not find any specific place in container definition, where I can fix the number of ECS instances.
There are a few ways to achieve this. One is to deploy your ECS service on Fargate. When you do so and you set your task count to, say, 2 ... ECS will deploy your 2 tasks onto 2 separate and dedicated operating systems/VMs managed by AWS. Two or more tasks can never be colocated to one of these VMs. It's always a 1 task : 1 VM relationship.
If you are using EC2 as your launch type and you want to make sure your service deploys exactly 1 task per instance the easiest way would be to configure your ECS service as type DAEMON. In this case you don't even need (or can't) configure the number of tasks in your service because ECS will always deploy 1 task per EC2 instance that is part of the cluster.
At the time of creating service you will find the field Number of tasks it means that how many container you want exactly. If you write 1 than it will launch only 1 and if you write 2 then it will launch 2 . I Hope you understand

AWS ECS: no container instance met all of its requirements

We have an ECS cluster with 3 EC2 instances. In this cluster we have a bunch of services running, all separate apps with 1 task.
Frequently when I try to run a new service, ECS tries to run the task in an EC2 instance with not enough memory/CPU, while there is a different instance available with more than enough. In fact, there are now 2 instances with both 5 tasks, and 1 instance with only 1.
What could be the reason of this weird division of tasks? I've tried every possible task placement strategy but that doesn't seem to make a difference.
Most recent error message:
service [service name] was unable to place a task because no container instance met all of its requirements. The closest matching container-instance [instance-id] has insufficient memory available.

CloudFormation, CodeDeploy, ELB & Auto-Scaling Group

I am trying to build a stack with an ELB, an Auto-Scaling Group and a Pipeline (with CodeBuild and CodeDeploy).
I can't understand how it is supposed to work:
the auto-scaling group is starting two instances and wait X minutes before starting to check the instances state
the CodeDeploy application deployment group is waiting for the Auto-Scaling group to be created and ready
the pipeline takes about 10 minutes to start deploying the application
My issue is when I create the stack, it looks like there is a loop: AG requires an application from CodeDeploy and CodeDeploy requires an AG stabilized. To be clear, when the application is ready to deploy, my Auto-Scaling group is already starting to terminate instances and starting new ones, so the CodeDeployment is trying to deploy to instances already terminated or terminating.
I don't really want to configure HealthCheckGracePeriod and PauseTime to be ~10-15 minutes... it is way too long.
Are there any best practices for CloudFormation + ELB + AG + CodeDeploy via a Pipeline?
What should be the steps to achieve that?
Thank you!
This stopping/staring the instances is most probably linked to the Deployment Type: in-place vs. blue/green.
I have tried both in my setup, and I will try to summarize how they work.
Let's say that for this example, you have an Autoscaling group which at the time of deploying the application has 2 running instances and the deployment configuration is OneAtATime. Traffic is controlled by the Elastic Load Balancer. Then:
In-place deployment:
CodeDeploy gets notified of a new revision available.
It tells the ELB to stop directing traffic to 1st instance.
Once traffic to one instance is stopped, it starts the deployment process: Stop the application, download bundle etc.
If the deployment is successful (validate service hook returned 0), it tells ELB to resume traffic to that instance.
At this point, 1 instance is running the old code and 1 is running the new code.
Right after the ELB stops traffic to the 2nd instance, and repeats the deployment process there.
Important note:
With ELB enabled, the time it takes to block traffic to instance before deployment, and time it takes to allow traffic after it are directly dependent on your health check: time = Healthy threshold * Interval.
Blue/green deployment:
CodeDeploy gets notified of a new revision available.
It copies your Autoscaling Group: the same configuration of the group (including scaling policies, scheduled actions etc.) and the same number of instances (using same AMI as your original AG) there were there at the start of deployment - in our case 2.
At this point, there is no traffic going to the new AG.
CodeDeploy performs all the usual installation steps to one machine.
If successful, it deploys to the second machine too.
It directs traffic from the instances in your old AG to the new AG.
Once traffic is completely re-routed, it deletes the old AG and terminates all its instances (after a period specified in Deployment Settings - this option is only available if you select Blue/Green)
Now ELB is serving only the new AG.
From experience:
Blue/green deployment is a bit slower, since you need to wait for the
machines to boot up, but you get a much safer and fail-proof deployment.
In general I would stick with Blue/green, with load balancer
enabled and the Deployment Configuration: AllAtOnce (if it fails,
customers won't be affected since the instances won't be receiving
traffic. But it will be 2x as fast since you deploy in parallel
rather than sequentially).
If your health checks and validate
service are throughout enough, you can probably delete the original
AG with minimal waiting time (5 minutes at the time of writing the
post).