Amazon ECS supports two different types of health checks:
Target Group health checks make a configurable network request
Container health checks run in the docker container and can be configured to run any shell command that the container supports
If both health checks are configured, which one wins? If either fails is the Service marked as UNHEALTHY? Or both? Can I configure one to override the other?
I'd very much like the Target Group health status to not cause ECS to continually bounce the service and I was hoping the container Health Check could be used to override it.
The AWS documentation is somewhat vague on this topic, but does suggest a high degree of coupling between ALB & ECS when it comes to health checks. i.e. see the documentation for healthCheckGracePeriodSeconds and minimumHealthyPercent for examples of ECS health check behaviour which is influenced by the presence or absence of a load balancer.
The healthCheckGracePeriodSeconds may be useful to avoid a failed ALB health check from causing the ECS container to be restarted (during service startup at least):
The period of time, in seconds, that the Amazon ECS service scheduler should ignore unhealthy Elastic Load Balancing target health checks, container health checks, and Route 53 health checks after a task enters a RUNNING state. This is only valid if your service is configured to use a load balancer. If your service has a load balancer defined and you do not specify a health check grace period value, the default value of 0 is used.
If your service's tasks take a while to start and respond to health checks, you can specify a health check grace period of up to 2,147,483,647 seconds during which the ECS service scheduler ignores the health check status. This grace period can prevent the ECS service scheduler from marking tasks as unhealthy and stopping them before they have time to come up.
In my experience, either one will cause your container to be decommissioned. I would say you probably don't need the container health check if you have a target group performing the check.
Related
We have a private EC2 Linux instance running behind an ALB. There is only one instance running and no auto-scaling configured.
Sometimes ALB marks the instance as unhealthy for some reasons. This mostly happens when network traffic is high on the instance, which generally one or two hours. This behavior is unpredictable. So when try to access the web application which is deployed in the EC2 instance, we get 502 bad gateway. We reboot the EC2 instance and only then the issue is resolved.
Does an ALB perform a health check on a target group again after it marks it as unhealthy? Suppose an ALB marks the target group with one EC2 instance as unhealthy. ALB is configured to perform a health check every 30 seconds. Will it check for healthiness after 30 seconds after it marked as unhealthy on the same target group? Or will it look for new healthy instance?
I assume auto-scaling configuration may resolve this problem by setting AS group with 1 when an instance go unhealthy? Our AWS architect feels the Tomcat is creating memory leak when too many requests come at a time.Tomcat does not run in the EC2.
What is the way to troubleshoot this problem? I search for system logs and configured ALB access logs, but no clue is available.
In this link I see ALB routes requests to the unhealthy targets when no other healths target is available .
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
My question is will ALB perform health check on the target group again after it marks it as unhealthy?
Indeed even when marked as unhealthy, the ALB continues the health checking. You can configure a 'healthy threshold count', which indicates how many 'healthy' responses should be received before an unhealthy host is marked as healthy again.
According to the docs:
When the health checks exceed HealthyThresholdCount consecutive successes, the load balancer puts the target back in service.
If your health check interval is 60 seconds, and the healthy threshold count is 3, it takes a minimum of 3 minutes before an unhealthy host will be marked healthy again.
I'm seeing a random failure of NLB health checks when registering the ECS fargate instances, healths checks get passed after a couple of failures. I do a wide open SG that's attached to fargate Instances. Did anyone had a similar behaviour while registering the tasks under NLB targegroups?
You application can take some time before it starts responding to the health checks from the ELB.
When you create a ECS service, there is an option called healthCheckGracePeriodSeconds.
It governs how many seconds ECS scheduler will ignore health checks information from the ELB. This option is only available if you use ELB.
So I recomend you to play with it and pick a suitable time frame for you.
I am currently working with AWS ECS and I'm a little confused on how you should configure the health check for containers deployed to AWS ECS.
You can define the healthcheck on the TargetGroup but you can also define the health check on the TaskDefinition.
I wanted to know what is best practice and why. Currently I have defined it in the TargetGroup and it works as expected.
But I wanted clarity on why you would use one over the other? And would you ever define it in both places?
I am using an Application Load Balancer with ECS.
You should use health check in ALB if you are using ALB.
If ALB check failed, ALB will make target group unhealthy and as a result, your container will be killed.
The most important in health check is the HTTP status code, it should be 200 or 3xx or 4xx depend on configuration. if the specified code does not match target will be marked unhealthy.
Both checks has difference purpose,
If you are using ALB, you should use ALB healthcheck
If you are using scheduler base Task, then you can use Docker container health checks.
Amazon Elastic Container Service (ECS) now supports Docker container
health checks. This gives you more control over monitoring the health
of your tasks and improves the ability of the ECS service scheduler to
ensure your services are healthy.
Previously, the ECS service scheduler relied on the Elastic Load
Balancer (ELB) to report container health status and to restart
unhealthy containers. This required you to configure your ECS Service
to use a load balancer, and only supported HTTP and TCP health-checks.
ecs-supports-container-health-checks-and-task-health-mana
If a service's task fails the load balancer health check criteria, the
task is stopped and restarted. This process continues until your
service reaches the number of desired running tasks.
service-load-balancing-health
I'm a little confused about Elastic Load Balancer health check and Amazon EC2 health check.
In Adding Health Checks to Your Auto Scaling Group it says:
If you have attached one or more load balancers to your Auto Scaling group and an instance fails the load balancer health checks, Auto Scaling does not replace the instance by default.
If you enable load balancer health checks and an instance fails the health checks, Auto Scaling considers the instance unhealthy and replaces it.
So if I don't enable the ELB health checks, EC2 health checks will work and if some instance fail health checks auto scaling will consider unhealthy instance and replaces it, and if I enable ELB health checks, the same thing will happen. So what's the difference between ELB health checks and EC2 health checks?
EC2 health check watches for instance availability from hypervisor and networking point of view. For example, in case of a hardware problem, the check will fail. Also, if an instance was misconfigured and doesn't respond to network requests, it will be marked as faulty.
ELB health check verifies that a specified TCP port on an instance is accepting connections OR a specified web page returns 2xx code. Thus ELB health checks are a little bit smarter and verify that actual app works instead of verifying that just an instance works.
That being said there is a third check type: custom health check. If your application can't be checked by simple HTTP request and requires advanced test logic, you can implement a custom check in your code and set instance health though API:
Health Checks for Auto Scaling Instances
The EC2 instances in my AWS autoscaling group all terminate after 1-4 hours hours of running. The exact time varies, but when it happens, the entire group goes down within minutes of each other.
The scaling history description for each is simply:
At 2016-08-26T05:21:04Z an instance was taken out of service in response to a EC2 health check indicating it has been terminated or stopped.
But I haven't added any health checks. And the EC2 status checks all pass for the life of the instance.
How do I determine what this "health check" failure actually means?
Most questions around ASG termination all lead back to the load balancer, but I have no load balancer. This cluster processes batch jobs, and min/max/desired values are controlled by software based on workload backlog elsewhere in the system.
The ASG history does not indicate a scale-in event, AND the instances are also all protected from scale-in explicitly.
I tried setting the health check grace period to 20 hours to see if that at least leaves the instance up so I can inspect it, but they all still terminate.
The instances are running an ECS AMI, and ECS is running a single task, started at bootup, in a container. The logs from that task look normal, and things seem to be running happily until a few minutes before the instance vanishes.
The task is CPU intensive, but error occurs still when I just have it sleep for six hours.
Here are few suggestions:
To see why instance was terminated, in EC2's Instance list select terminated instance, and select Get System Log in Instance Settings (menu), then scroll down to the bottom to see any obvious issues. The logs are kept for a while after instance is terminated.
In ECS cluster within your active service, check Events tab for any messages.
In Target Group section, verify Health checks and Targets (Registered targets and their Status, and Health of the Availability Zones.
To modify health check settings for a target group using the AWS Console, choose Target Groups, and edit Health checks.
In ASG (EC2's Auto Scaling group), check Details (for Termination Policies), Activity History (for termination messages), Instances (for their Health Status), Scheduled Actions and Scaling Policies.
Check CloudWatch for any available logs.
Check CloudTrail for any suspicious events.
Verify that ECS agents are connected: Why is my Amazon ECS agent listed as disconnected?
Check also: Health Checks for Your Target Groups and Amazon ECS Troubleshooting.
For more suggestions, check: terraform-ecs. Registered container instance is showing 0
By default, without an ELB, the ASG will only use instance status checks. However the actual message you are getting "an instance was taken out of service in response to a EC2 health check indicating it has been terminated or stopped" sounds more like the OS on the instance shutdown or somebody (or some process) initiated a stop or terminate command. Are these spot instances? This is what you will see if spot instances are terminated.