I have a ECS Service deployed into Fargate.
It is attached to Network Load Balancer. Rolling update was working fine but suddenly I see the below issue.
When I update the service with new task definition Fargate starts the deployment and tries to start new container. Since I have the service attached to NLB, the new task registers itself with the NLB Target Group.
But NLB Target Group's health check fails. So Fargate kills the failed task and starts new task. This is being repeated multiple times(this number actually varies, today it took 7 hours for the rolling update to finish).
There are no changes to the infra after the deployment. Security group is allowing traffic within the VPC. NLB and ECS Service are deployed into same VPC, same subnet.
Fargate health check fails for the task with same docker image N number of times but after that it starts working.
Target Group healthy/unhealthy threshold is 3, protocol is TCP, port is traffic-port and the interval is 30. In the microservice startup log I see this,
Started myapp in 44.174 seconds (JVM running for 45.734)
When the task comes up, I tried opening security group rule for the VPN and tried accessing the Task IP directly. I can reach the microservice directly with task IP.
But why NLB Health Check is failing?
I had the exact same issue.
simulated it with different images (go, python) as I suspected of utilization overhead in CPU/Mem, which was false.
The mitigation can be changing the Fargate deployment parameter Minimum healthy percent to 50% (while before it was 100% and seemed to cause the issue).
After the change, the failures would become seldom, but it would still occur.
The real solution is still unknown, it seems to be something related to the NLB Configuration in Fargate
Related
I am currently trying to deploy 2 ECS services on a single EC2 instance for test environment.
Here is what I have done so far:
Successfully created 2 Security Groups for Load Balancer and EC2 instance.
My EC2 Security Group
My ALB Security Group
Successfully created 2 different Task Definitions for my 2 applications, all Spring Boot application. First application is running on port 8080, Container Port in Task Definition is also 8080. The second application is running on port 8081, Container Port in Task Definition is also 8081.
Successfully created an ECS cluster with an Auto-Scaling Group as Capacity Provider. The cluster also recognizes the Container Instance created from Auto-Scaling Group (I am using t2.micro since it is in free-tier package). Attached created Security Group to EC2 instance.
My EC2 Security Group
Successfully created an ALB with 2 forward listeners 8080 and 8081 configured to 2 different Target Groups for each service. Attached created Security Group to ALB.
Here is how the ECS behaves with my services:
I attempted to create 2 new services. First service mapped with port 8080 on ALB. The second one mapped with port 8081 on ALB. Each of them have different Target Group but the Health Check configurations are the same
Health Check Configuration for Service 1
Health Check Configuration for Service 2
The first service was deployed pretty smooth, health check returned success on the first try.
However, for the second service, I use the exact same configuration as the first one, just a different port listener on ALB and the application container running on a different port number as well (which I believe that it should not be a problem). The service attempted 10 times before it fails the deployment and I noticed getting this repeated error message: service <service_name> instance <instance_id> port <port_number> is unhealthy in target-group <target_group_name> due to (reason Health checks failed).
This did not happen with my first service with the same configuration. The weird thing is that when I attempted to send a request the ALB domain name on port 8081, the application on the second service seems to be working fine without any error. It is just that the Unhealthy Check keeps throwing my service off.
I went over bunch of posts and nothing really helps with the current situation. Also, it is kind of dumb since I cannot dig any further details rather than this info in this image below.
Anyone has any suggestion to resolve this problem? Would really appreciate it.
I am using AWS Elastic Beanstalk. In there, I selected a Traffic Splitting deploy strategy, with a 100% split (so that 100% of new instances will have the new version and have their health evaluated).
Here's how (according to their documentation) that is supposed to work:
During a traffic-splitting deployment, Elastic Beanstalk creates a new set of instances in a separate temporary Auto Scaling group. Elastic Beanstalk then instructs the load balancer to direct a certain percentage of your environment's incoming traffic to the new instances. Then, for a configured amount of time, Elastic Beanstalk tracks the health of the new set of instances. If all is well, Elastic Beanstalk shifts remaining traffic to the new instances and attaches them to the environment's original Auto Scaling group, replacing the old instances. Then Elastic Beanstalk cleans up—terminates the old instances and removes the temporary Auto Scaling group.
And more specifically:
Rolling back the deployment to the previous application version is quick and doesn't impact service to client traffic. If the new instances don't pass health checks, or if you choose to abort the deployment, Elastic Beanstalk moves traffic back to the old instances and terminates the new ones.
However, it seems silly that it only looks at my internal /health health checks, and not the overall health status of the environment, from the HTTP status codes, that it already has information on.
I tried the following scenario:
Deploy a new version.
As soon as the "health evaluation period" begins, flood the server with error 500s (from an endpoint I made specifically for this purpose).
AWS then moves all my instances into "degraded" state, and "unhealthy", but then seems to ignore it, and goes on anyway.
See the following two log dump screenshots (they are oldest-first).
Is there any way that I can make AWS respect the HTTP status based health checks that it already performs, during a traffic split? Or am I bound to only rely on custom-developed health checks entirely?
Update 1: Even weirder, I tried making my own healthchecks fail always too, but it still decides to deploy the new version with the failed healthcheck!
Update 2: I noticed that the temporary auto scaling group that it creates while assessing health does only have an "EC2" type health check, and not "ELB". I think that might be the root cause. If I could only get it to use "ELB" instead.
That is interesting! I do not know if setting the health check type to "ELB" may do the job because we use CodeDeploy, which has far better rollback capabilities than AWS Elastic Beanstalk.
However, there is a well-document way in the docs [1] to apply the setting you are looking for:
[...] By default, the Auto Scaling group, created for your environment uses Amazon EC2 status checks. If an instance in your environment fails an Amazon EC2 status check, Auto Scaling takes it down and replaces it.
Amazon EC2 status checks only cover an instance's health, not the health of your application, server, or any Docker containers running on the instance. If your application crashes, but the instance that it runs on is still healthy, it may be kicked out of the load balancer, but Auto Scaling won't replace it automatically. [...]
If you want Auto Scaling to replace instances whose application has stopped responding, you can use a configuration file to configure the Auto Scaling group to use Elastic Load Balancing health checks. The following example sets the group to use the load balancer's health checks, in addition to the Amazon EC2 status check, to determine an instance's health.
Example .ebextensions/autoscaling.config
Resources:
AWSEBAutoScalingGroup:
Type: "AWS::AutoScaling::AutoScalingGroup"
Properties:
HealthCheckType: ELB
HealthCheckGracePeriod: 300
It does not mention the new traffic splitting deployment feature, though.
Thus, I cannot confirm this is the actual solution, but at least you can give it a shot.
[1] https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/environmentconfig-autoscaling-healthchecktype.html
Once upon a time I thought that the Immutable Deployment option in Elastic Beanstalk was a holy panacea -- but it only works when a deployment involves no changes to the application's database schema.
We've now resorted to blue-green deployments. However, this only works if you control the DNS. If you are a SaaS solution and you allow customers to create a CNAME then B/G is often a spectacular failure as the enterprise: a) sets a very high TTL, and/or b) their internal DNS or firewalls caches the underlaying IP addresses of the ALB (which are dynamic and, of course, replaced when you swap the URL of the blue and green environments).
Traffic splitting is written as an option in the Elastic Beanstalk documentation.
But it's not actually an option in the configuration section in the console.
This wouldn't be the first time I've seen Elastic Beanstalk's docs are out of date so it could be AWS have removed that feature.
Since AWS introduced CodeStar I suspect Elastic Beanstalk is getting the cold shoulder.
I've been using AWS CodeDeploy to push our applications live, but it always takes ages doing the BlockTraffic and AllowTraffic steps. Currently, I have an application load balancer(ALB) with three EC2 nodes initially(behind an autoscaling group). So, If I do a CodeDeploy OneAtATime, the whole process takes up to 25 minutes.
The load balancer I'm using it with had connection draining set to 300s. I thought it was the reason for drag out. However, I disabled Connection Draining and got the same results. I then enabled Connection Draining and set timeout to 5 seconds and still got the same results.
Further, I found out CodeDeploy depends on the ALB Health Check settings. according to the AWS documentation
After an instance is bound to the ALB, CodeDeploy waits for the
status of the instance to be healthy ("inService") behind the load
balancer. This health check is done by ALB and depends on the health
check configuration.
So I tried by setting low timeouts and thresholds for health check settings. Even those changes didn't reduce the deployment time much.
Can someone direct me to a proper solution to speed up the process?
The issue is the de-registration of instances from the AWS target group. You want to change this value:
or find a way to update the deregistration_delay.timeout_seconds property - by default it's 300s, which is 5 minutes. The docs can be found here).
I have set up a Docker container running on port 5566 with a small Django application. The Docker image is uploaded into the ECR and later used by Fargate container(s).
I have set up an ECS cluster with a VPC.
After creating the Task Definition and Service, the Service starts up 2 tasks (as it is supposed to):
Here's the Service's Network Access (with health check grace period on 300s):
I also set up an Application Load Balancer (with DNS) with a target group for the service, but the health checks seem to be failing:
Here's the health check configuration:
Because the health checks are failing the tasks are terminated and new ones are started after ~every 5 minutes.
Here's the container's port mapping:
As one cannot access the Fargate container (via SSH for example) and the logs are empty, how should I troubleshoot the issue?
I have tried to follow every step in the Troubleshoot Your Application Load Balancer.
Feel free to ask additional information.
can you confirm once, your application is working on port 5566 inside docker?
you can check logs in cloudwatch. you'll get the link in cluster -> service -> tasks -> your task.
Can you post your ALB configuration? your Target group port?
I'm running through this tutorial to create a deployment pipeline with my custom .net-based docker image.
But when I start a deployment, it's stuck on install phase, so I have to stop it manually:
After that I get a couple of running tasks with different task definitions (note :1 and :4, 'cause I've tried to run deployment 4 times by now):
They also change their state RUNNING->PROVISIONING->PENDING all the time. And the list of stopped tasks grows:
Q:
So, how to hunt down the issue with CodeDeploy? Why It's running forever?
UPDATE:
It is connected to health checks.
UPDATE:
I'm getting this:
(service dataapi-dev-service, taskSet ecs-svc/9223370487815385540) (port 80) is unhealthy in target-group dataapi-dev-tg1 due to (reason Health checks failed with these codes: [404]).
Don't quite understand, why is it failing for newly created container, 'cause the original one passes health-check.
While the ECS task is running, ELB (Elastic Load Balancer) will constantly do healthchecking the container as you config in the target group to check if the container is still responding.
From your debug message, the container (api) responded the healthcheck path with 404.
I suggest you config the healhcheck path in target group dataapi-dev-tg1.
For those who are still hitting this issue: in my case the ECS cluster had no outbound connectivity.
Possible solutions to this problem:
make security groups you use with your VPC allow outbound traffic
make sure that the route table you use with VPC has subnet associations with subnets you use with your load balancer (examine route tables)
I have able to figure it out because I enabled CloudWatch during ECS cluster creation and got CannotPullContainerError. For more information on solving this problem look into Cannot Pull Container Image Error.
Make sure your Internet Gateway is attached to your Subnets through the Route Table (Routes), if your Load Balancer is internet facing.
The error is due to health check which detected an unhealthy target.
Make sure to check your configuration in Target group settings.