is there any way to overrwrite the non-default listener rule of ALB while doing blue-green canary deployment using codedeploy with fargate? - amazon-web-services

I am using ECS blue-green canary deployment and deploying with canary configuration as "shift 10% traffic for 5 minutes and then move all to 100%". This configuration is working fine. But I have some specific requirement as follows:
Usecase: With default rule on port 80, I have also added two more listener rule on port 80 (say rule-1, rule-2). rule-1 and rule-2 have multiple domains setup with it so that traffic from those domains can be routed to different target groups as we want at any particular deployment phase.
So before deployment the state of port listener looks like this (assuming tg1 is blue and tg2 is green):
default rule: 100% tg1
rule-1: 100% tg1
rule-2: 100% tg1
AfterAllowTestTraffic the state of port listener looks like this (this is done by codedeploy internally)
default rule: 90% tg1, 10% tg2
rule-1 rule: 90% tg1, 10% tg2
rule-2 rule: 90% tg1, 10% tg2
But my requirement is to use the hook "AfterAllowTestTraffic" or any other hooks so that I can overwrite the above traffic rule as per below using lambda function with AWS-SDK elbv2.modifyRule().
default rule: 90% tg1, 10% tg2
rule-1 rule: 100% tg1(blue)
rule-2 rule: 100% tg2(green) (This will help me to route some specific domain to the green instances till the traffic wait time(2 days)).
I tried above solution but the codedeploy somehow waits till "AfterAllowTestTraffic" hook to be completed and then overwrite the listeners again with 90% and 10%.
I tried this with other hooks like "BeforeAllowTraffic" but still somehow codedeploy overwrites the listeners again with 90% and 10%.
Ideally the listener should be shifted to 90% and 10% first and then it should go for executing the lambda attached with hook "BeforeAllowTraffic" but somehow the first traffic shift happens after "BeforeAllowTraffic" or after the hook which overwrites listener if you try to overwrite the listener rules using hook with canary deployment, why?

Related

Putting ALb-NLB-ALB route for requests is giving 502 for application

We had a primary ALB listening to all out apps mapped through R53 records. Now we have listener rule crunch as ALB doesn't support more rules above 100. So we had been proposed a solution where we can put a NLB under primary ALB and then secondary ALB under NLB.
So flow will be:
Requests--->R53--->ALB1--->NLB--->ALB2--->Apps
ALB1 has a default rule which allows unmatched requests to pass through to NLB and then ultimately to ALB2 where new rules are evaluated.
Rule configuration at ALB1 is:
Default rule --Forwardto-->
Rule at NLB:
TCP-443 listener rule --ForwardTo--> ALB2 TG with fargate application ip
But we're seeing intermittent 502 responses on primary ALB while testing. We are not seeing any 502 logging on ALB2. So possibly NLB is ending it as we have seen multiple TArget reset count happening at NLB in metrics.
Also, nothing is getting logged in application logs.
We did another testing where we directly routed traffic to ALB2 through R53, we didn't see any 502 responses there.
Any suggestion, how to go about the debugging it?
I think, I have the answer to my problem now, so sharing it for wider audience. The reason for intermittent 502s was the inconsistency of idle_timeout_value across the Lbs and backend application.
Since for NLB idle_timeout_value is set to 350 seconds by default, and can't be changed, we had inconsistent values across LBs. First ALB and last ALB had value 600 seconds.
Ideally application should have highest idle_timeout_value followed by LBs in hierarchy.
So setting up value of first ALB to 300 seconds and second ALB to 500 seconds solved this problem. And we haven't got a single 500 code post this implementation.

Fargate deployment restarting multiple times before it comes online

I have a ECS Service deployed into Fargate.
It is attached to Network Load Balancer. Rolling update was working fine but suddenly I see the below issue.
When I update the service with new task definition Fargate starts the deployment and tries to start new container. Since I have the service attached to NLB, the new task registers itself with the NLB Target Group.
But NLB Target Group's health check fails. So Fargate kills the failed task and starts new task. This is being repeated multiple times(this number actually varies, today it took 7 hours for the rolling update to finish).
There are no changes to the infra after the deployment. Security group is allowing traffic within the VPC. NLB and ECS Service are deployed into same VPC, same subnet.
Fargate health check fails for the task with same docker image N number of times but after that it starts working.
Target Group healthy/unhealthy threshold is 3, protocol is TCP, port is traffic-port and the interval is 30. In the microservice startup log I see this,
Started myapp in 44.174 seconds (JVM running for 45.734)
When the task comes up, I tried opening security group rule for the VPN and tried accessing the Task IP directly. I can reach the microservice directly with task IP.
But why NLB Health Check is failing?
I had the exact same issue.
simulated it with different images (go, python) as I suspected of utilization overhead in CPU/Mem, which was false.
The mitigation can be changing the Fargate deployment parameter Minimum healthy percent to 50% (while before it was 100% and seemed to cause the issue).
After the change, the failures would become seldom, but it would still occur.
The real solution is still unknown, it seems to be something related to the NLB Configuration in Fargate

AWS Codedeploy BlockTraffic/AllowTraffic durations

I've been using AWS CodeDeploy to push our applications live, but it always takes ages doing the BlockTraffic and AllowTraffic steps. Currently, I have an application load balancer(ALB) with three EC2 nodes initially(behind an autoscaling group). So, If I do a CodeDeploy OneAtATime, the whole process takes up to 25 minutes.
The load balancer I'm using it with had connection draining set to 300s. I thought it was the reason for drag out. However, I disabled Connection Draining and got the same results. I then enabled Connection Draining and set timeout to 5 seconds and still got the same results.
Further, I found out CodeDeploy depends on the ALB Health Check settings. according to the AWS documentation
After an instance is bound to the ALB, CodeDeploy waits for the
status of the instance to be healthy ("inService") behind the load
balancer. This health check is done by ALB and depends on the health
check configuration.
So I tried by setting low timeouts and thresholds for health check settings. Even those changes didn't reduce the deployment time much.
Can someone direct me to a proper solution to speed up the process?
The issue is the de-registration of instances from the AWS target group. You want to change this value:
or find a way to update the deregistration_delay.timeout_seconds property - by default it's 300s, which is 5 minutes. The docs can be found here).

AWS Application Load Balancer health checks fail

I have an ecs fargate cluster with an ALB to route the traffic to. The docker containers are listening on port 9000.
My containers are accessible over the alb dns name via https. That works. But they keep getting stopped/deregistered from the target group and restarted only to be in unhealthy state immediately after they are registered in the target group.
The ALB has only one listener on 443.
The security groups are set up so that the sg-alb allows outbound traffic on port 9000 to sg-fargate and sg-fargate allows all inbound traffic on port 9000 from sg-alb.
The target group is also setup to use port 9000.
I'm not sure what the problem is, or how to debug it.
Everything is set up with cdk. Not sure if that's relevant.
As it turns out this was not a problem with security groups. It was just coincidental, that it worked at the time when I changed the security groups.
It seems the containers aren't starting fast enough to accept connections from the alb when it starts the health checks.
What helped:
changing healthCheckGracePeriod to two minutes
tweaking the healthcheck paremeters for the target group, interval, unhealthyThreshold, healthyThreshold
Also, in my application logs it looks like the service gets two health check requests at once. Per default the unhealthy threshold is set to 2. So maybe the service was marked unhealthy only after one health check.

AWS CodeDeploy deployment failed at event BlockTraffic

I am trying to set up auto-deployment from GitHub to AWS, using EC2 behind an ELB.
After following the Tutorial: Use AWS CodeDeploy to Deploy an Application from GitHub, my deployment fails at the BlockTraffic event, after trying for an hour (1h 2min last time) with error code ScriptFailed. I'm not sure how to troubleshoot the issue/where to look.
The ELB target group target health status: healthy
Health Check configuration:
Healthy threshold: 2
Unhealthy threshold: 2
Timeout: 5
Interval: 10
Success codes: 200
don't enable Load Balancer on Code Deploy deployment group for Pipeline and you will get rid off that BlockTraffic and AllowTraffic steps.
Screenshot
Make sure your Code Deploy role has sufficient access to register and de-register instance if it is behind ELB.
Below permissions may required.
"elasticloadbalancing:DescribeLoadBalancers",
"elasticloadbalancing:DescribeInstanceHealth",
"elasticloadbalancing:RegisterInstancesWithLoadBalancer",
"elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
"elasticloadbalancing:DescribeTargetGroups",
"elasticloadbalancing:DescribeTargetHealth",
"elasticloadbalancing:RegisterTargets",
"elasticloadbalancing:DeregisterTargets"
There is an AWSCodeDeployRole policy, makes it very easy to cover the permissions you need to use codedeploy
I had the same issue and I realised that in the deployment group,
I didn't tag the instance id of the target group, upon which I was doing health checks to find whether the target group was healthy or not. Hence deployment group knew the target group, it has to deal traffic with.
The issue I ran into is that for an ELB if the port was not the expected port, code deploy's BlockTraffic would not know how to deregister the instance from the target group.
In my example I had my HTTPS ELB communicate via HTTP to port 3000 on each of my target groups. I found the specific root cause by using this guide: https://aws.amazon.com/premiumsupport/knowledge-center/codedeploy-failed-ec2-deployment/
It gave the following output which identified the instance of me using port 3000 instead of the expected port 80.
During BlockTraffic, Codedeploy service invoke the Loadbalancer to de-register the instance from the target group before start installing the application revision
DeregisterTargets API call can be noticed in cloudtrail logs during BlockTraffic lifecycle hook
Currently Codedeploy does not support the case when the target group have a different port than the port used to register the instance.
** DeregisterTargets API will not be able to deregister the instance if the port configured in the Target group is different
You need to make sure that both the target group and the instance are configured to use the same port.
BlockTraffic depends mainly on the de-registration delay on the target group or connection draining on Classic LB. To speed up this step, the de-registration delay /connection draining value can be reduced to a reasonable value.