AWS CodeDeploy: stuck on install step - amazon-web-services

I'm running through this tutorial to create a deployment pipeline with my custom .net-based docker image.
But when I start a deployment, it's stuck on install phase, so I have to stop it manually:
After that I get a couple of running tasks with different task definitions (note :1 and :4, 'cause I've tried to run deployment 4 times by now):
They also change their state RUNNING->PROVISIONING->PENDING all the time. And the list of stopped tasks grows:
Q:
So, how to hunt down the issue with CodeDeploy? Why It's running forever?
UPDATE:
It is connected to health checks.
UPDATE:
I'm getting this:
(service dataapi-dev-service, taskSet ecs-svc/9223370487815385540) (port 80) is unhealthy in target-group dataapi-dev-tg1 due to (reason Health checks failed with these codes: [404]).
Don't quite understand, why is it failing for newly created container, 'cause the original one passes health-check.

While the ECS task is running, ELB (Elastic Load Balancer) will constantly do healthchecking the container as you config in the target group to check if the container is still responding.
From your debug message, the container (api) responded the healthcheck path with 404.
I suggest you config the healhcheck path in target group dataapi-dev-tg1.

For those who are still hitting this issue: in my case the ECS cluster had no outbound connectivity.
Possible solutions to this problem:
make security groups you use with your VPC allow outbound traffic
make sure that the route table you use with VPC has subnet associations with subnets you use with your load balancer (examine route tables)
I have able to figure it out because I enabled CloudWatch during ECS cluster creation and got CannotPullContainerError. For more information on solving this problem look into Cannot Pull Container Image Error.

Make sure your Internet Gateway is attached to your Subnets through the Route Table (Routes), if your Load Balancer is internet facing.

The error is due to health check which detected an unhealthy target.
Make sure to check your configuration in Target group settings.

Related

The ELB could not be updated due to the following error: Primary taskset target group must be behind listener arn:aws:elasticloadbalancing

I'm trying to learn about ECS and Blue/Green deployments with CodePipeline.
I have been following several tutorials but I'm stuck with this error:
The ELB could not be updated due to the following error: Primary
taskset target group must be behind listener
arn:aws:elasticloadbalancing:us-east-1:XXX:listener/app/ecs-t-XXX/XXX/XXX.
This is what I have:
I created 2 listeners: one for PROD (8080) and the other for TEST (8088)
I created 2 Target Groups: one for each listener.
When I go to my Load Balancer and check the Listeners I can see them there.
I have 2 services.
The original (Service1) and the new one (Service2)
Both services have the same configuration (except for the target group under load balancer: Service1 has the target-1 and Service2 the target-2).
Service2 has CODE_DEPLOY as DeploymentController and REPLICA as SchedulingStrategy (Service1 doesn't)
In CodePipeline, when I reach the deploy action, it fails with the previous message.
The Load Balancer configuration seems fine:
Listener 8080 forwards to target1
Listener 8088 forwards to target2
I checked my CodeDeploy application and deployment group.
Everything seems fine. Under Load Balancing, I have target-1 which points to the PROD listener, and target-2 which points to the TEST listener.
In relation to the environment, the ECS Service is the Service2 (aka, the new one).
Permissions are fine.
So, what am I not seeing?
I searched about this error but I could not find an answer that worked for me.
The closest one was about not attaching the target groups to the Load Balancer. But in my case, I have them attached to the Load Balancer and the listeners are attached to the respective target group.
I'd appreciate help. I'm out of ideas.
This might be the issue:
I created 2 Target Groups: one for each listener. When I go to my Load Balancer and check the Listeners I can see them there.
If you've already pointed both listeners to their corresponding target groups, this issue will occur. Try pointing both the listeners to the same target group under which application is currently running.
More troubleshooting steps here: https://aws.amazon.com/premiumsupport/knowledge-center/ecs-blue-green-deployment/

Fargate deployment restarting multiple times before it comes online

I have a ECS Service deployed into Fargate.
It is attached to Network Load Balancer. Rolling update was working fine but suddenly I see the below issue.
When I update the service with new task definition Fargate starts the deployment and tries to start new container. Since I have the service attached to NLB, the new task registers itself with the NLB Target Group.
But NLB Target Group's health check fails. So Fargate kills the failed task and starts new task. This is being repeated multiple times(this number actually varies, today it took 7 hours for the rolling update to finish).
There are no changes to the infra after the deployment. Security group is allowing traffic within the VPC. NLB and ECS Service are deployed into same VPC, same subnet.
Fargate health check fails for the task with same docker image N number of times but after that it starts working.
Target Group healthy/unhealthy threshold is 3, protocol is TCP, port is traffic-port and the interval is 30. In the microservice startup log I see this,
Started myapp in 44.174 seconds (JVM running for 45.734)
When the task comes up, I tried opening security group rule for the VPN and tried accessing the Task IP directly. I can reach the microservice directly with task IP.
But why NLB Health Check is failing?
I had the exact same issue.
simulated it with different images (go, python) as I suspected of utilization overhead in CPU/Mem, which was false.
The mitigation can be changing the Fargate deployment parameter Minimum healthy percent to 50% (while before it was 100% and seemed to cause the issue).
After the change, the failures would become seldom, but it would still occur.
The real solution is still unknown, it seems to be something related to the NLB Configuration in Fargate

AWS Codedeploy BlockTraffic/AllowTraffic durations

I've been using AWS CodeDeploy to push our applications live, but it always takes ages doing the BlockTraffic and AllowTraffic steps. Currently, I have an application load balancer(ALB) with three EC2 nodes initially(behind an autoscaling group). So, If I do a CodeDeploy OneAtATime, the whole process takes up to 25 minutes.
The load balancer I'm using it with had connection draining set to 300s. I thought it was the reason for drag out. However, I disabled Connection Draining and got the same results. I then enabled Connection Draining and set timeout to 5 seconds and still got the same results.
Further, I found out CodeDeploy depends on the ALB Health Check settings. according to the AWS documentation
After an instance is bound to the ALB, CodeDeploy waits for the
status of the instance to be healthy ("inService") behind the load
balancer. This health check is done by ALB and depends on the health
check configuration.
So I tried by setting low timeouts and thresholds for health check settings. Even those changes didn't reduce the deployment time much.
Can someone direct me to a proper solution to speed up the process?
The issue is the de-registration of instances from the AWS target group. You want to change this value:
or find a way to update the deregistration_delay.timeout_seconds property - by default it's 300s, which is 5 minutes. The docs can be found here).

AWS ECS Fargate ALB Error (Request Timed Out)

I have set up a Docker container running on port 5566 with a small Django application. The Docker image is uploaded into the ECR and later used by Fargate container(s).
I have set up an ECS cluster with a VPC.
After creating the Task Definition and Service, the Service starts up 2 tasks (as it is supposed to):
Here's the Service's Network Access (with health check grace period on 300s):
I also set up an Application Load Balancer (with DNS) with a target group for the service, but the health checks seem to be failing:
Here's the health check configuration:
Because the health checks are failing the tasks are terminated and new ones are started after ~every 5 minutes.
Here's the container's port mapping:
As one cannot access the Fargate container (via SSH for example) and the logs are empty, how should I troubleshoot the issue?
I have tried to follow every step in the Troubleshoot Your Application Load Balancer.
Feel free to ask additional information.
can you confirm once, your application is working on port 5566 inside docker?
you can check logs in cloudwatch. you'll get the link in cluster -> service -> tasks -> your task.
Can you post your ALB configuration? your Target group port?

AWS CodeDeploy deployment failed at event BlockTraffic

I am trying to set up auto-deployment from GitHub to AWS, using EC2 behind an ELB.
After following the Tutorial: Use AWS CodeDeploy to Deploy an Application from GitHub, my deployment fails at the BlockTraffic event, after trying for an hour (1h 2min last time) with error code ScriptFailed. I'm not sure how to troubleshoot the issue/where to look.
The ELB target group target health status: healthy
Health Check configuration:
Healthy threshold: 2
Unhealthy threshold: 2
Timeout: 5
Interval: 10
Success codes: 200
don't enable Load Balancer on Code Deploy deployment group for Pipeline and you will get rid off that BlockTraffic and AllowTraffic steps.
Screenshot
Make sure your Code Deploy role has sufficient access to register and de-register instance if it is behind ELB.
Below permissions may required.
"elasticloadbalancing:DescribeLoadBalancers",
"elasticloadbalancing:DescribeInstanceHealth",
"elasticloadbalancing:RegisterInstancesWithLoadBalancer",
"elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
"elasticloadbalancing:DescribeTargetGroups",
"elasticloadbalancing:DescribeTargetHealth",
"elasticloadbalancing:RegisterTargets",
"elasticloadbalancing:DeregisterTargets"
There is an AWSCodeDeployRole policy, makes it very easy to cover the permissions you need to use codedeploy
I had the same issue and I realised that in the deployment group,
I didn't tag the instance id of the target group, upon which I was doing health checks to find whether the target group was healthy or not. Hence deployment group knew the target group, it has to deal traffic with.
The issue I ran into is that for an ELB if the port was not the expected port, code deploy's BlockTraffic would not know how to deregister the instance from the target group.
In my example I had my HTTPS ELB communicate via HTTP to port 3000 on each of my target groups. I found the specific root cause by using this guide: https://aws.amazon.com/premiumsupport/knowledge-center/codedeploy-failed-ec2-deployment/
It gave the following output which identified the instance of me using port 3000 instead of the expected port 80.
During BlockTraffic, Codedeploy service invoke the Loadbalancer to de-register the instance from the target group before start installing the application revision
DeregisterTargets API call can be noticed in cloudtrail logs during BlockTraffic lifecycle hook
Currently Codedeploy does not support the case when the target group have a different port than the port used to register the instance.
** DeregisterTargets API will not be able to deregister the instance if the port configured in the Target group is different
You need to make sure that both the target group and the instance are configured to use the same port.
BlockTraffic depends mainly on the de-registration delay on the target group or connection draining on Classic LB. To speed up this step, the de-registration delay /connection draining value can be reduced to a reasonable value.