Application ELB route traffic to new added instance before grace period

Application ELB route traffic to new added instance before grace period - amazon-web-services

I have setup Auto scaling group and setup grace period to 300 (5mins). My new instance takes max 2.5 mins to boot up and ready to handle HTTP requests. But I am noticing that each time my new instance is added ELB starts forwarding traffic to new instance even way before grace period i.e 5mins. Due to which I am facing 502 Bad Gateway error.
Can anyone guide me why my application load balancer is behaving like it?
I am using ELB type health checks and below are settings of my target group health check
Protocol : HTTP
Port : 80
Healthy threshold : 2
Unhealthy threshold : 10
Timeout : 10
Interval : 150
Success codes : 200

This is a normal behavior. Grace period is not there to prevent health checks from happening. This holds true for both ELB and EC2 service health checks. During the grace period that you specify, both ELB and EC2 service will send health checks to your instance. The difference here is that auto-scaling will not act upon the results of these checks. Which means that auto-scaling will not automatically schedule instance for replacement.
Only after the instance is up and running correctly (passed ELB and EC2 health checks), will ELB register the instance and starts sending normal traffic to it. But this can happen before the grace period expires. If you see 502 Error after the instance has been registered with ELB then your problem is somewhere else.

Finally I resolved my issue. I am writing my solution here to help anyone else here facing same issue.
In my case, my initial feeling was that Application Load Balancer is routing traffic to newly added instance before it is ready to serve. But detailed investigation showed that was not the issue. In my case new instance was able to serve traffic at start and after few mins it was generating this ELB level 502 error for around 30 seconds and after that it starts working normally.
Solution:
The Application has a default connection KeepAlive of 60 seconds. Apache2 has a default connection KeepAlive of 5 seconds. If the 5 seconds are over, the Apache2 closes its connection and resets the connection with the ELB. However, if a request comes in at precisely the right time, the ELB will accept it, decide which host to forward it to, and in that moment, the Apache closes the connection. This will result in said 502 error code.
I set the ELB timeout to 60 seconds and the Apache2 timeout to 120 seconds. This solved my problem.

Related

Getting 5xx error with AWS Application Load Balancer - fluctuating healthy and unhealthy target group

My web application on AWS EC2 + load balancer sometimes shows 500 errors. How do I know if the error is on the server side or the application side?
I am using Route 53 domain and ssl on my url. I set the ALB redirect requests on port 80 to 443, and forward requests on port 443 to the target group (the EC2). However, the target group is returning 5xx error code sometimes when handling the request. Please see the screenshots for the metrics and configurations for the ALB.
Target Group Metrics
Target Group Configuration
Load Balancer Metrics
Load Balancer Listeners
EC2 Metrics
Right now the web application is running unsteady, sometimes it returns a 502 or 503 service unavailable (seems like it's a connnection timeout).
I have set up the ALB idle timeout 4000 secs.
ALB configuration
The application is using Nuxt.js + PHP7.0 + MySQL + Apache 2.4.54.
I have set the Apache prefork worker Maxclient number as 1000, which should be enough to handle the requests on the application.
The EC2 is a t2.Large resource, the CPU and Memory look enough to handle the processing.
It seems like if I directly request the IP address but not the domain, the amount of 5xx errors significantly reduced (but still exists).
I also have Wordpress application host on this EC2 in a subdomain (CNAME). I have never encountered any 5xx errors on this subdomain site, which makes me guess there might be some errors in my application code but not on the server side.
Is the 5xx error from my application or from the server?
I also tried to add another EC2 in the target group see if they can have at lease one healthy instance to handle the requests. However, the application is using a third-party API and has strict IP whitelist policy. I did some research that the Elastic IP I got from AWS cannot be attached to 2 different EC2s.

First of all, if your application is prone to stutters, increase healthcheck retries and timeouts, which will affect your initial question of flapping health.
To what I see from your screenshot, most of your 5xx are due to either server or application (you know obviously better what's the culprit since you have access to their logs).
To answer your question about 5xx errors coming from LB: this happens directly after LB kicks out unhealthy instance and if there's none to replace (which shouldn't be the case because you're supposed to have ASG if you enable evaluation of target health for LB), it can't produce meaningful output and thus crumbles with 5xx.
This should be enough information for you to make adjustments and logs investigation.

EC2 Instance giving better output than ELB

We have cluster of 3 EC2 instances. Single EC2 instance is able to server aroung 500 user load on application. But when same EC2 instnace is put in ELB is not even serving for 250 users. We drilled more & put below configuration at different end.
Optimized code to respond in less time.
ELB is set with 300 sec timeout for all responses & healthy.unhealthy checks.
Apache on EC2 is set with 600 as timeout value & keep alive it set true.
ELB is routing request in equal distribution logic.
Every time we hit with higher load(500 on cluster) we are getting end up getting some failures with 504 bad gateway timeout error. Kindly help with solution to get more optimial output.

Is there a way to let LB route traffic to host once it become healthy?

I have two ec2 instances behind an application load balancer, I'm trying to update application on host by host (Rolling update). To do that i follow those steps :
1. stop nginx sevice
2. update application
3. start nginx service
by stopping nginx service LB mark host as unhealthy and route traffic to the other host. and do the same thing on the second host.
the problem is that after starting nginx service and LB mark host as a healthy it don't route traffic to this host only after some times (4 min on average).
like that i have a critical downtime.
LB Settings :
Is there a way to let LB route traffic to host once it become healthy?

Update: Based on the chat discussion, the cause of the issue for now is unknown. Further troubleshooting will resume soon though.
One way would be to customize your health check. You haven't included your ELB health check settings, thus I will assume that you use the default ones.
By default, the classic load balancer users the following settings:
Notice that Healthy Threshold and HealthCheck Interval are 10 and 30 seconds respectively. This basically evaluates to 5 minutes (5 min = 300 seconds = 10 x 30). Thus your instances require 5 minutes to be considered healthy by the balancer.
Adjusting these settings should reduce the time you observe.

ECS Fargate + Network Load Balancer Healthcheck

I'm experiencing an issue with the following setup:
API Gateway -> VPC Link -> Private NLB -> Target Group -> AWS ECS Fargate
If I setup the NLB's Health Check to be TCP/HTTP on a specified endpoint, that endpoint gets hammered to the death with internal request (no requests are coming through the API Gateway, I checked):
My problem with this behaviour, other than having the health's endpoint spammed by my own architecture is that the application's functionality is suffering (I keep getting slow responses 1 out of 4 get request to the API).
I tried to modify the Health Check's behaviour to only TCP, same slow responses.
I tried temporarily switching to a public ALB, I'm incurring in double health-checks, separated by 30 seconds but my application is responding with an average of 100 ms.
So, as an example of what I mean by "double health-checks":
Health Check 1.1 at 00:00:00
Health Check 2.1 at 00:00:10
Health Check 1.2 at 00:00:30
Health Check 2.2 at 00:00:40
Any ideas?

TL/DR;
Enable the "Cross-Zone Load Balancing" NLB flag.
The issue was the "cross-availability zone" not checked out.
It seems that when a request gets processed by a NLB-node which resides in a different AZ from the one that it is trying to be redirecting, it tries to internally resolve the IP in the AZ, if it fails, it redirects the request to another NLB-node in the appropriate AZ, which will be able to do so, hence reaching the target.

ELB always reports instances as inservice

I am using aws ELB to report the status of my instances to an autoscaling group so a non-functional instance would be terminated and replaced by a new one. The ELB is configured to ping TCP:3000 every 60 seconds and wait for a timeout of 10 seconds to consider it a health check failure. the unhealthy threshold is 5 consecutive checks.
However the ELB always reports my instances as healthy and inservice all the time even though I periodically manually come across an instance that is timing out and I have to terminate it manually and launch a new one despite the ELB reporting it as inservice all the time
Why does this happen ?

After investigating a little bit I found that
I am trying to assess the health of the app through an api callto a web app running on the instance and wait for the response to timeout to declare the instance faulty. I needed to use http as the protocol to call port 3000 with a custom path through the load balancer instead of tcp.
Note: The api needs to return a status code of 200 for the load balancer to consider it healthy. It now works perfectly

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js