AWS Codedeploy BlockTraffic/AllowTraffic durations - amazon-web-services

I've been using AWS CodeDeploy to push our applications live, but it always takes ages doing the BlockTraffic and AllowTraffic steps. Currently, I have an application load balancer(ALB) with three EC2 nodes initially(behind an autoscaling group). So, If I do a CodeDeploy OneAtATime, the whole process takes up to 25 minutes.
The load balancer I'm using it with had connection draining set to 300s. I thought it was the reason for drag out. However, I disabled Connection Draining and got the same results. I then enabled Connection Draining and set timeout to 5 seconds and still got the same results.
Further, I found out CodeDeploy depends on the ALB Health Check settings. according to the AWS documentation
After an instance is bound to the ALB, CodeDeploy waits for the
status of the instance to be healthy ("inService") behind the load
balancer. This health check is done by ALB and depends on the health
check configuration.
So I tried by setting low timeouts and thresholds for health check settings. Even those changes didn't reduce the deployment time much.
Can someone direct me to a proper solution to speed up the process?

The issue is the de-registration of instances from the AWS target group. You want to change this value:
or find a way to update the deregistration_delay.timeout_seconds property - by default it's 300s, which is 5 minutes. The docs can be found here).

Related

Why deploying fargate over a load balancer service takes more time than its own service?

I'm making some basic architecture with some backend and frontend docker containers.
I started deploying a backend container, defined a task with AWS Fargate and configured its own service. The time of the deployments took about 2 minutes, but now I have a Network Load Balancer service with target group targeting to my container, and the deploy takes about 6-10 minutes, is it normal?
This deploy time is not bothering me but, while I'm studying this technology, this is getting a little slow (for example against a kubernetes cluster), my plan is to have a bigger architecture with back, front, db and auto-scalers, just for learning.
Without sharing your configuration (CloudFormation Template) or some screenshots its hard to say exactly what is causing this issue but I can point you to some areas that could cause longer deployments.
Possibly have a default health heck interval set to 30 seconds. This means that it waits 30 seconds to carry out another check on the TCP port.
You can reduce this to 10 seconds to speed up health checks
Possibly have the default health check healthy threshold set as 3. This coupled with the default check interval adds around 90 seconds to stabilise a container and return a healthy status.
Deploying a single container and not having Cross-Zone load balancing enabled and also having round robin routing algorithm.
This setup can fail the health check and then the threshold count starts again and you need 3 in a row to return a healthy status.
Lastly, Fargate because there is no image caching the pull time is a bit longer than ECS backed by EC2 instances so can also add to the deployment time.
With all the above configured correctly you can get back a few minutes usually on a Network Load Balancer target group deployment but they are usually a bit slower than Application Load Balancer deployments.
If your app does not require a Network Load Balancer its recommended to use an application load balancer. An application load balancer goes much deeper and is capable of determining availability based on not only a successful HTTP GET of a particular page but also the verification that the content is as was expected based on the input parameters.

Fargate deployment restarting multiple times before it comes online

I have a ECS Service deployed into Fargate.
It is attached to Network Load Balancer. Rolling update was working fine but suddenly I see the below issue.
When I update the service with new task definition Fargate starts the deployment and tries to start new container. Since I have the service attached to NLB, the new task registers itself with the NLB Target Group.
But NLB Target Group's health check fails. So Fargate kills the failed task and starts new task. This is being repeated multiple times(this number actually varies, today it took 7 hours for the rolling update to finish).
There are no changes to the infra after the deployment. Security group is allowing traffic within the VPC. NLB and ECS Service are deployed into same VPC, same subnet.
Fargate health check fails for the task with same docker image N number of times but after that it starts working.
Target Group healthy/unhealthy threshold is 3, protocol is TCP, port is traffic-port and the interval is 30. In the microservice startup log I see this,
Started myapp in 44.174 seconds (JVM running for 45.734)
When the task comes up, I tried opening security group rule for the VPN and tried accessing the Task IP directly. I can reach the microservice directly with task IP.
But why NLB Health Check is failing?
I had the exact same issue.
simulated it with different images (go, python) as I suspected of utilization overhead in CPU/Mem, which was false.
The mitigation can be changing the Fargate deployment parameter Minimum healthy percent to 50% (while before it was 100% and seemed to cause the issue).
After the change, the failures would become seldom, but it would still occur.
The real solution is still unknown, it seems to be something related to the NLB Configuration in Fargate

How does ALB distribute requests to Fargate service during rolling update deployment?

I deploy a Fargate service in a cluster and use rolling update deployment. I configured an ALB in front of my service, and it is doing a health check as well. During the upgrade, I can see that my current task is marked as INACTIVE, and the new task is deployed. Both of the two tasks are in running state.
I understand that the ALB is doing a health check on the newly deployed tasks, but it keeps two tasks running for 5 minutes.
I have a few questions about this deployment period of time.
Does ALB distribute user requests to my new tasks before passing health check?
If the answer for the first question is no, Does ALB distribute user requests to the new service after passing health check before the old services is down?
If the second answer is yes, then there will be two versions of tasks running inside my service to serve user requests for 5 minutes. Is this true? How can I make sure it only send requests to one service at a time.
I don't want to change the deployment method to BLUE/GREEN. I want to keep the rolling update at the moment.
ALB will not send traffic to a task that is not yet passing health checks, so no to #1. ALB will send traffic to both old and new whilst deploying, so yes to #2. As soon as a replacement task is available ALB will start to drain the task it is replacing. The default time for that is 5 minutes. During that time the draining instance will not receive traffic, so sort of no to #3. The sort of part is that you will have some time with version A and B of your service will both be deployed. How long that is depends on the number of tasks and how long it takes for them to start to receive traffic.
The only way I can think of to send all traffic to one version and then hard cut over to the other is to create a completely new target group each time, keeping the old one active. Then, once the new target group is running switch to it. You'd have to change the routes in the ALB as you do that.
By the way, what is happening now is what I would call a rolling deployment.

AWS CodeDeploy: stuck on install step

I'm running through this tutorial to create a deployment pipeline with my custom .net-based docker image.
But when I start a deployment, it's stuck on install phase, so I have to stop it manually:
After that I get a couple of running tasks with different task definitions (note :1 and :4, 'cause I've tried to run deployment 4 times by now):
They also change their state RUNNING->PROVISIONING->PENDING all the time. And the list of stopped tasks grows:
Q:
So, how to hunt down the issue with CodeDeploy? Why It's running forever?
UPDATE:
It is connected to health checks.
UPDATE:
I'm getting this:
(service dataapi-dev-service, taskSet ecs-svc/9223370487815385540) (port 80) is unhealthy in target-group dataapi-dev-tg1 due to (reason Health checks failed with these codes: [404]).
Don't quite understand, why is it failing for newly created container, 'cause the original one passes health-check.
While the ECS task is running, ELB (Elastic Load Balancer) will constantly do healthchecking the container as you config in the target group to check if the container is still responding.
From your debug message, the container (api) responded the healthcheck path with 404.
I suggest you config the healhcheck path in target group dataapi-dev-tg1.
For those who are still hitting this issue: in my case the ECS cluster had no outbound connectivity.
Possible solutions to this problem:
make security groups you use with your VPC allow outbound traffic
make sure that the route table you use with VPC has subnet associations with subnets you use with your load balancer (examine route tables)
I have able to figure it out because I enabled CloudWatch during ECS cluster creation and got CannotPullContainerError. For more information on solving this problem look into Cannot Pull Container Image Error.
Make sure your Internet Gateway is attached to your Subnets through the Route Table (Routes), if your Load Balancer is internet facing.
The error is due to health check which detected an unhealthy target.
Make sure to check your configuration in Target group settings.

AWS Health Check Restart API

I have an AWS load balancer with 2 ec2 instances serving an API in Python.
If I have 10K request come in at the same time, and the AWS health check comes in, the health check will fail, and there is a 502/504 gateway error because of instances restart due the to failed health check.
I check the instances CPU usage, max at 30%, and memory maxed at 25%.
What's the best option to fix this?
A few things to consider here:
Keep the health check API fairly light, but ensure that the health check API/URL indeed returns correct responses based on the health of the app.
You can configure the health check to mark the instance as failed only after X failed checks. You can tune this parameter and the Health check frequency to match your needs.
You can disable the EC2 restart from failed health-check by configuring your autoscaling group health-check type to EC2. This will prevent instances from being terminated due to a failed ELB health-check.