I have a backend server deployed on aws in a single EC2 instance via elastic beanstalk. The server has ip whitelisting and hence does not respond to ALB health checks, so all target groups always remain unhealthy.
According to the official AWS docs on health checks,
If a target group contains only unhealthy registered targets, the load balancer nodes route requests across its unhealthy targets.
This is what keeps my application running even though the ALB target groups are always unhealthy.
This changed last night and I faced an outage where all requests started getting rejected with 503s for reasons I'm not able to figure out. I was able to get things to work again by provisioning another EC2 instance by increasing minimum capacity of elastic beanstalk.
During the window of the outage, cloudwatch shows there is neither healthy nor unhealthy instances, though nothing actually changed as there was one EC2 instance running for past few months untouched.
In that gap, I can find metrics on TCP connections though:
I don't really understand what happened here, can someone explain what or how to debug this?
Related
My main issue is trying to work out why my health checks are failing on ECS.
My setup
I have successfully set up an ECS cluster using an EC2 auto-scaling group. All the EC2 are in private subnets with NAT gateways.
I have a load-balancer all connected up to the target group which is linked to ECS.
When I try and get an HTTP response from the load balancer from my local machine, it times out. So I am obviously not getting responses back from the containers.
I have been able to ssh into the EC2 instances and confirmed the following:
ECS is deploying containers onto the EC2 instances, then after some time killing them and then firing them up again
I can curl the healthcheck endpoint from the EC2 instance (localhost) and it runs successfully
I can reach the internet from the EC2 instance, eg curl google.com returns an html response
My question is there seems to be two different types of health-check going on, and I can't figure out which is which.
ELB health-checks
The ELB seems, as far as I can tell, to use the health-checks defined in the target group.
The target group is defined as a list of EC2 instances. So does that mean the ELB is sending requests to the instances to see if they are running?
This would of course fail because we cannot guarantee that ECS will have deployed a container to each instance.
ECS health-checks
ECS however is responsible for deploying containers into these instances, in what could turn out to be a many-to-many relationship.
So surely ECS would be querying the actual running containers to find out if they are healthy and then killing them if required.
My confusion / question
I don't really understand what role the ELB has in managing the EC2 instances in this context.
It doesn't seem like the EC2 instances are being stopped and started. However from reading the docs it seems to indicate that the ASG / ELB will manage the EC2 instances and restart them if they fail the healthcheck.
Does ECS somehow override this default behaviour and take responsibility for running the healthchecks instead of the ELB?
And if not, won't the health check just fail on any EC2 instance that happens not to have a container running on it?
We have a private EC2 Linux instance running behind an ALB. There is only one instance running and no auto-scaling configured.
Sometimes ALB marks the instance as unhealthy for some reasons. This mostly happens when network traffic is high on the instance, which generally one or two hours. This behavior is unpredictable. So when try to access the web application which is deployed in the EC2 instance, we get 502 bad gateway. We reboot the EC2 instance and only then the issue is resolved.
Does an ALB perform a health check on a target group again after it marks it as unhealthy? Suppose an ALB marks the target group with one EC2 instance as unhealthy. ALB is configured to perform a health check every 30 seconds. Will it check for healthiness after 30 seconds after it marked as unhealthy on the same target group? Or will it look for new healthy instance?
I assume auto-scaling configuration may resolve this problem by setting AS group with 1 when an instance go unhealthy? Our AWS architect feels the Tomcat is creating memory leak when too many requests come at a time.Tomcat does not run in the EC2.
What is the way to troubleshoot this problem? I search for system logs and configured ALB access logs, but no clue is available.
In this link I see ALB routes requests to the unhealthy targets when no other healths target is available .
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html
My question is will ALB perform health check on the target group again after it marks it as unhealthy?
Indeed even when marked as unhealthy, the ALB continues the health checking. You can configure a 'healthy threshold count', which indicates how many 'healthy' responses should be received before an unhealthy host is marked as healthy again.
According to the docs:
When the health checks exceed HealthyThresholdCount consecutive successes, the load balancer puts the target back in service.
If your health check interval is 60 seconds, and the healthy threshold count is 3, it takes a minimum of 3 minutes before an unhealthy host will be marked healthy again.
Outline:
I have a very simple ECS container which listens on port 5000 and writes out HelloWorld, plus the hostname of the instance it is running on. I want to deploy many of these containers using ECS and load balance them just to really learn more about how this works. And it is working to a certain extent but my health check is failing (time out) which is causing the containers tasks to be bounced up and down.
Current configuration:
1 VPC ( 10.0.0.0/19 )
1 Internet gateway
3 private subnets, one for each AZ in eu-west-1 (10.0.0.0/24, 10.0.1.0/24, 10.0.2.0/24)
3 public subnets, one for each AZ in eu-west-1 (10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24)
3 NAT instances, one in each of the public subnets, routing 0.0.0.0/0 to the Internet gateway and each assigned an Elastic IP
3 ECS instances, again one in each private subnet with a route to the NAT instance in the corresponding public subnet in the same AZ as the ECS instance
1 ALB load balancer (Internet facing) which is registered with my 3 public subnets
1 Target group (with no instances registered as per ECS documentation) but a health check set up on the 'traffic' port at /health
1 Service bringing up 3 tasks spread across AZs and using dynamic ports (which are then mapped to 5000 in the docker container)
Routing
Each private subnet has a rule to 10.0.0.0/19, and a default route for 0.0.0.0/0 to the NAT instance in public subnet in the same AZ as it.
Each public subnet has the same 10.0.0.0/19 route and a default route for 0.0.0.0/0 to the internet gateway.
Security groups
My instances are in a group that allows egress to anywhere and ingress on ports 32768 - 65535 from the security group the ALB is in.
The ALB is in a security group that allows ingress on port 80 only but egress to the security group my ECS instances are in on any port/protocol
What happens
When I bring all this up, it actually works - I can take the public dns record of the ALB and refresh and I see responses coming back to me from my container app telling me the hostname. This is exactly what I want to achieve however, it fails the health check and the container is drained, and replaced - with another one that fails the health check. This continues in a cycle, I have never seen a single successful health check.
What I've tried
Tweaked the health check intervals to make ECS require about 5
minutes of solid failed health-checks before killing the task. I
thought this would eliminate it being a bit sensitive when the task
starts up? This still goes on to trigger the tear-down, despite me
being able to view the application running in my browser throughout.
Confirmed the /health url end point in a number of ways. I can retrieve it publicly via the ALB (as well as view the main app root url at '/') and curl tells me has a proper 200 OK response (which the health check is set to look for by default). I have ssh'ed into my ECS instances and performed a curl --head {url} on '/' and '/health' and both give a 200 OK response. I've even spun up another instance in the public subnet, granted it the same access as the ALB security group to my instances and been able to curl the health check from there.
Summary
I can view my application correctly load-balanced across AZs and private subnets on both its main url '/' and its health check url '/health' through the load balancer, from the ECS instance itself, and by using the instances private IP and port from another machine within the public subnet the ALB is in. The ECS service just cannot see this health check once without timing out. What on earth could I be missing??
For any that follow, I managed to break the app in my container accidentally and it was throwing a 500 error. Crucially though, the health check started reporting this 500 error -> therefore it was NOT a network timeout. Which means that when the health-check contacts the end point in my app, it was not handling the response properly and this appears to be a problem related to Nancy (the api framework I was using) and Go which sometimes reports "Client.Timeout exceeded while awaiting headers" and I am sure ECS is interpreting this as a network time-out. I'm going to tcpdump the network traffic and see what the health-check is sending and Nancy is responding and compare that to a container that works. Perhaps there is a Nancy fix or maybe ECS needs to not be so fussy.
edit:
By simply updating all the nuget packages that my Nancy app was using to the latest available and suddenly everything started working!
More questions than answers. but maybe they will take you in the right direction.
You say that you can access the container app via the ALB, but then the node fails the heath check. The ALB should not be allowing connection to the node until it's health check succeeds. So if you are connecting to the node via the ALB, then the ALB must have tested and decided it was healthy. Is it a different health check that is killing the node ?
Have you check CloudTrail to see if it has any clues about what is triggering the tear-down ? Is the tear down being triggered by the ALB or the auto scaling group? Might it be that auto scaling group has the wrong scale-in criteria ?
Good luck
I have 2 machines running under an Elastic Beanstalk environment.
One of them is down since the last deployment.
I was hoping that the auto scaling configuration will initiate a new machine due to having a single machine available.
That didn't happen and I'm trying to figure out what's wrong with my auto scaling configuration:
The first thing I see is that your rules contradict each other. It says if the number of unhealthy hosts are above 0, add a single host. If they are below 2, remove a single host. That may explain why you aren't seeing anything happening with your trigger.
Scaling triggers are used to bring in, or reduce, EC2 instances in your Auto Scaling group. This would be useful to bring in an additional instance(s) to maintain the same amount of computational power for your application while you investigate what caused the bad instance to fail. But this will not replace the instance.
To setup your instances to terminate after a certain period of being unhealthy you can follow the documentation here.
By default ELB pings port 80 with TCP, this is what determines the "health" of the EC2 instance, along with the on host EC2 instance status check. You can specify a Application health check URL to setup a customized health check that your application returns. Check out the more detailed customization of Beanstalk ELBs here.
We are using CodeDeploy to load code onto our instances as they boot up. Our intention was that they would not be added to the LB prior to the code being loaded. To do this, we set a health check which looked for one of the files being deployed. What we have found is that some times instances without code are created (I assume code deploy failed) and these instances are staying in the LB even when marked unhealthy? How is this possible? Is this related to the grace period? Shouldn't instances that are unhealthy be removed automatically?
I believe I have found a large part of my problem: My Auto-scale group was set to use EC2 health checks and not my ELB health check. This resulted in the instance not being terminated. The traffic may have continued to flow longer to this crippled instance due to the need the need for a very long unhealthy state before having traffic completely stopped.