Is there any event that could cause an EC2 instance to be removed from an ELB in AWS?
We have 4 services in a load balancer that usually chug away doing their thing, being updated only very occasionally. We've had issues from time to time where some of the instances would choke and need to be restarted, but they'd still be in the load balancer, simply listed as OutOfService.
However, today, we checked and found only 1 instance listed in the load balancer (as in completely removed, not OutOfService). The other 3 were healthy and the health check URL was returning a 200 status code. There's only two of us with access to the account so it definitely wasn't done manually.
Is there anything that could've caused the instances to be removed from the load balancer?
In this specific instance, the issue was caused by our deployment process taking machines out of the load balancer when deploying (i.e. doing a blue-green deployment). When a deployment issue occurred, the nodes wouldn't be added back into the load balancer.
Related
We have an ELB which can scale up and down from 1 to 4 instances. When we deploy a new verison of the server it spins up a new instance.
Because we need 256 bit encryption on our HTTPS we've been forced to use the classic load balancer (the only one where we can force clients, in our case android and iphone apps). What we noticed is that the load balancer is strictly associated to a specific instance. So when the server is re-deployed then the load balancer stops working b/c the instance it was associated with does not exist anymore.
Is there any way to handle this? Or is there a way where we can use a application load balancer and still get 256 bit encryption?
You could use rolling updates with Elastic Beanstalk. During an update a new instance will be added first. The new instance should be associated with the Load Balancer. When the new instance is up and running, the old one is stopped and removed from the Load Balancer. New instances should be associated with the Load Balancer automatically. If this doesn't happen, there must be something happening with your Elastic Beanstalk Environment.
You can definitely use 256 encryption with a Network Load Balancer. You will have to terminate SSL at the EC2 instances.
I have a backend server deployed on aws in a single EC2 instance via elastic beanstalk. The server has ip whitelisting and hence does not respond to ALB health checks, so all target groups always remain unhealthy.
According to the official AWS docs on health checks,
If a target group contains only unhealthy registered targets, the load balancer nodes route requests across its unhealthy targets.
This is what keeps my application running even though the ALB target groups are always unhealthy.
This changed last night and I faced an outage where all requests started getting rejected with 503s for reasons I'm not able to figure out. I was able to get things to work again by provisioning another EC2 instance by increasing minimum capacity of elastic beanstalk.
During the window of the outage, cloudwatch shows there is neither healthy nor unhealthy instances, though nothing actually changed as there was one EC2 instance running for past few months untouched.
In that gap, I can find metrics on TCP connections though:
I don't really understand what happened here, can someone explain what or how to debug this?
I registered an instance with the load balancer, and index.html of the public DNS address of the instance is accessed.
However, the instance appears as an inservice on the load balancer, but it says N / A, and the index.html of the DNS address of the load balancer does not run.
After hearing that it took time to register, I tried again two days later, but it still doesn't work.
Is there something I'm missing?
Based on the comments.
The issue turned out to be incorrectly set security groups.
Other common possible reasons are listed in the aws documentation:
Troubleshoot a Classic Load Balancer: Instance Registration
The original problem I am trying to solve is when the load balancer starts forwarding requests to a newly-initialized ec2 instance, the first request to that new instance takes ~10 seconds. Subsequent requests are fine (~100 ms for the same request). I have also observed the browser taking a long time to load the web application after I replace the ec2 instance in the load balancer. I believe both problems likely have the same root cause, and since the latter problem is easier to explain and test, I will provide details on that problem.
I have the following infrastructure set up in AWS for this test:
AMI that contains a web application, hosted in IIS
ASG that points to that AMI with desired capacity = 1
Target group with appropriate health check
Application load balancer
Here is the test I run:
Terminate the ec2 instance that is in the ASG
Wait for the ASG to replace the ec2 instance
Wait until the ASG reports that new instance as healthy
Directly load the web application via the ec2 instance IP in Incognito Chrome browser (no load balancer) - loads in <100 ms
Load the web application via the load balancer in Incognito Chrome browser - takes ~20 seconds
I can repeat this test for some time, usually with similar results. Eventually, it seems like something "clicks" and the site starts loading very fast through the load balancer.
What could cause this behavior? Is there something we could change in the load balancer configuration to resolve this issue? As noted above, the web application loads very quickly when accessed directly via the ec2 IP address, so it's not an issue with the application itself.
Yes, you can control the setting from the load balancer to mark target health with minimum health threshold.
You can tune the initial response using these flags.
HealthCheck Interval
The amount of time between health checks of an individual instance, in seconds.
Valid values: 5 to 300
Default: 30
Healthy Threshold
The number of consecutive successful health checks that must occur before declaring an EC2 instance healthy.
Valid values: 2 to 10
Default: 10
So with the default value, the instance take to healthy that is 10*30
To tune these configuration update the HealthCheck Interval to 5 and Healthy Threshold to 2 so it will take 10*2 to load balancer to mark the target healthy and route the traffic to the endpoint.
elb-healthchecks or target-group-health-checks
My application was working fine when my ec2 was launched in a public subnet, which was attached to the application load balancer. There were no such latency issues.
When I moved my Ec2 to a private subnet and then attached that private subnet to ABL (application load balancer), this issue popped up.
The issue I faced was as follows: When I accessed my ec2 instance via Route 53 DNS which points to ABL for the first time, it took 1-2 minutes or (gateway timeout) to show the result and from the network tab in the browser "initial connection" was taking this time. After the first response, the remaining site works normally and gives a response in < 1 second.
I solve this problem by creating a new subnet in the same availability zone where my private subnet was associated and then attached the newly created subnet with ALB instead of the private subnet. This way I managed to solve this problem.
Example: testEc2 is created in subnet A (private subnet in us-east-1a) and then creates a new subnet B (public subnet in us-east-1a)(attached to internet gateway in route table).
In Load Balancer -> Edit subnet -> select us-east-1a and select public subnet and save.
I hope this will solve the above issue.
We are using CodeDeploy to load code onto our instances as they boot up. Our intention was that they would not be added to the LB prior to the code being loaded. To do this, we set a health check which looked for one of the files being deployed. What we have found is that some times instances without code are created (I assume code deploy failed) and these instances are staying in the LB even when marked unhealthy? How is this possible? Is this related to the grace period? Shouldn't instances that are unhealthy be removed automatically?
I believe I have found a large part of my problem: My Auto-scale group was set to use EC2 health checks and not my ELB health check. This resulted in the instance not being terminated. The traffic may have continued to flow longer to this crippled instance due to the need the need for a very long unhealthy state before having traffic completely stopped.