AWS ECS Fargate Target Group Failing HealthChecks - amazon-web-services

The SpringBoot application is running as an ECS Task in a ECS Service of an AWS Fargate Cluster. The ECS Service is LoadBalanced as such the Tasks spawned by the Services are automatically registered to a target group.
I am able to call the Health endpoint via API Gateway => VPC Link => Network ELB => Application ELB => ECS Task, as shown below:
However, the HealthChecks seem to be failing and as such, the tasks are being deregistered continously resulting in totally unusable setup.
I have made sure to configure the HealthCheck of the Target Group point towards the right endpoint URL, as shown below:
I also made sure that the Security Group that the Fargate Tasks belong in allows traffics from the Application Load Balancer, as shown below:
But somehow, the HealthChecks kept failing and the tasks are being deregistered, and I'm very confused!
Your help is much appreciated!

The problem actually is with the Health Check Intervals (30 seconds) and Threshold (2 checks) which is too frequent when the Task is just starting up and is unable to respond to the HTTP request.
So, I increased the interval and the threshold, and everything is fine now!

Related

ECS ELB Health Checks

My main issue is trying to work out why my health checks are failing on ECS.
My setup
I have successfully set up an ECS cluster using an EC2 auto-scaling group. All the EC2 are in private subnets with NAT gateways.
I have a load-balancer all connected up to the target group which is linked to ECS.
When I try and get an HTTP response from the load balancer from my local machine, it times out. So I am obviously not getting responses back from the containers.
I have been able to ssh into the EC2 instances and confirmed the following:
ECS is deploying containers onto the EC2 instances, then after some time killing them and then firing them up again
I can curl the healthcheck endpoint from the EC2 instance (localhost) and it runs successfully
I can reach the internet from the EC2 instance, eg curl google.com returns an html response
My question is there seems to be two different types of health-check going on, and I can't figure out which is which.
ELB health-checks
The ELB seems, as far as I can tell, to use the health-checks defined in the target group.
The target group is defined as a list of EC2 instances. So does that mean the ELB is sending requests to the instances to see if they are running?
This would of course fail because we cannot guarantee that ECS will have deployed a container to each instance.
ECS health-checks
ECS however is responsible for deploying containers into these instances, in what could turn out to be a many-to-many relationship.
So surely ECS would be querying the actual running containers to find out if they are healthy and then killing them if required.
My confusion / question
I don't really understand what role the ELB has in managing the EC2 instances in this context.
It doesn't seem like the EC2 instances are being stopped and started. However from reading the docs it seems to indicate that the ASG / ELB will manage the EC2 instances and restart them if they fail the healthcheck.
Does ECS somehow override this default behaviour and take responsibility for running the healthchecks instead of the ELB?
And if not, won't the health check just fail on any EC2 instance that happens not to have a container running on it?

AWS Fargate: How to deploy a service fargate task with a network load balancer

Background
Current State: I currently have a nlb that routes to an nginx server running on an ec2 instance.
Goal
I am trying to replace the nginx ec2 instance with a fargate service that runs nginx.
I would like to keep the current nlb and set the fargate cluster as the target group for the existing nlb.
Problem
according to aws documentation, aws ecs fargate cluster service supports loadbalancing with nlb or alb: https://docs.aws.amazon.com/AmazonECS/latest/userguide/service-load-balancing.html
when I try to deploy the nginx task, in the load balancing section,
there is only an option to select an existing alb or create a new
alb.
I tried changing the task protocol to TCP and UDP--regardless of
the protocol, when I try to deploy the task as a service, the only
load balancer option is still application load balancer.
Question
How do I load balance to a fargate cluster service task using an nlb? Am I missing a specific setting somewhere?
If you cannot set the fargate cluster as a target group for an nlb directly, would it be reasonable to route traffic from an nlb to an alb and then set the alb target group as a fargate cluster?
You can absolutely use an NLB with an ECS Fargate service. I've done this before many times. My guess is you are simply encountering a bug in the AWS web UI. I've always used Terraform to deploy this sort of thing. I just checked in the ECS web UI, and on the 2nd step of creating a new ECS service I get the option of using a Network Load Balancer:
If your view doesn't look like that, try switching from the "New ECS Experience" in the UI which is still fairly beta and missing a lot of features.
I just went back and checked, and in the new ECS UI they are currently missing the option to select an NLB, so you have to continue using the old version of the UI for now until they fix that. I suggest continuing to use the old UI until they phase it out, because the new ECS UI is still missing a lot of features.

Is there a way to configure health checks to an ECS service without a load balancer?

I have an ECS Cluster with 2 ECS Services (1 app-controller, 1 app-event-processor). Is there a way to get health checks on both while API traffic only goes to app-controller? I realize health checks normally come from the load balancer but if I configure the load balancer to hit app-event-processor then API traffic also starts flowing to app-event-processor which is undesirable since I want that to handle only messages from SQS for example.
As #jordanm mentioned in their comment ECS does provide a built in health-check mechamism that is orthogonal (and in addition) to the "outside" LB health-check.

How configure health check for containers deployed to AWS ECS

I am currently working with AWS ECS and I'm a little confused on how you should configure the health check for containers deployed to AWS ECS.
You can define the healthcheck on the TargetGroup but you can also define the health check on the TaskDefinition.
I wanted to know what is best practice and why. Currently I have defined it in the TargetGroup and it works as expected.
But I wanted clarity on why you would use one over the other? And would you ever define it in both places?
I am using an Application Load Balancer with ECS.
You should use health check in ALB if you are using ALB.
If ALB check failed, ALB will make target group unhealthy and as a result, your container will be killed.
The most important in health check is the HTTP status code, it should be 200 or 3xx or 4xx depend on configuration. if the specified code does not match target will be marked unhealthy.
Both checks has difference purpose,
If you are using ALB, you should use ALB healthcheck
If you are using scheduler base Task, then you can use Docker container health checks.
Amazon Elastic Container Service (ECS) now supports Docker container
health checks. This gives you more control over monitoring the health
of your tasks and improves the ability of the ECS service scheduler to
ensure your services are healthy.
Previously, the ECS service scheduler relied on the Elastic Load
Balancer (ELB) to report container health status and to restart
unhealthy containers. This required you to configure your ECS Service
to use a load balancer, and only supported HTTP and TCP health-checks.
ecs-supports-container-health-checks-and-task-health-mana
If a service's task fails the load balancer health check criteria, the
task is stopped and restarted. This process continues until your
service reaches the number of desired running tasks.
service-load-balancing-health

AWS ELB zero downtime deploy

With an ELB setup, there as healthcheck timeout, e.g. take a server out of the LB if it fails X fail checks.
For a real zero down time deployment, I actually want to be able to avoid these extra 4-5 seconds of down time.
Is there a simple way to do that on the ops side, or does this needs to be in the level of the web server itself?
If you're doing continuous deployment you should deregister the instance you're deploying to from ELB (say, aws elb deregister-instances-from-load-balancer), wait for the current connections to drain, deploy you app and then register an instance with ELB.
http://docs.aws.amazon.com/cli/latest/reference/elb/deregister-instances-from-load-balancer.html
http://docs.aws.amazon.com/cli/latest/reference/elb/register-instances-with-load-balancer.html
It is also a common strategy to deploy to another AutoScaling Group, then just switch ASG on the load balancer.