Route 53 failover routing takes downtime of 6 to 8 minutes - amazon-web-services

I am getting "502 bad gateway error" between switching regions of Route 53 Failover.
Switching between primary to secondary takes 2-3 minutes if primary is down.
Meanwhile when on DR site IF primary comes up It will takes another 6 to 8 minutes for redirecting traffic from DR to primary. How to completely minimizes downtime from 6 to 8 minutes to 0?

You need to check how long it takes your ELB Health Check + Route53 Health Checks to determine a failover is required, the final step is the TTL of the DNS records.
For example, let's say you have a web application, hosted behind and ELB, and you are accessing it via myapp.mydomain.com.
ELB Health Check
While the primary thing you should check is the R53 health check (see below), the ELB configuration is also important.
Look at how long it should take to determine failure:
HealthCheck Interval - The amount of time between health checks
Unhealthy Threshold - How many failed health checks
Make sure this configuration is the same in ELBs in both regions.
Route53 Health Check
This is the main thing that will determine how long failover takes.
You probably have 2 CNAME records for myapp.mydomain.com, each pointing to a R53 health check, and each health check points at an ELB at it's respective region.
Check both health checks and make sure:
Request interval - How often R53 will poll your ELB for it's health.
Failure threshold - The number of consecutive health checks that an endpoint must pass or fail for the status to change.
Make sure both health check's config (Primary and Secondary) are the same.
Once the status changes, it's up to the DNS record TTL.
Route53 CNAME TTL
Check how long your CNAMES will point to a record after a failover by looking at the record TTL. For example, if TTL is 30, it will take approx. 30 seconds for Route53 to start pointing to the secondary region.
Make sure both CNAME records have the same TTL.
After following this you can determine how long it should take to failover, for example:
Your health checks are looking at port 80:/ availability, your health checks take approx. 30 seconds, and your apache dies on the primary site.
Within 30 (example) seconds ELB will determine instances out of service and stop forwarding traffic.
Within the same 30 (example) seconds the R53 health check which is monitoring the same healthcheck (port 80:/) will also determine primary ELB is unhealthy.
This is where R53 decides to start pointing DNS queries to your secondary ELB.
If your TTL is set to 30, failover should be completed in approx. 1 minute, +/- some time for propagation, etc.
Make sure not to set your health checks to be too frequent, depending on how many instances are behind your ELB, it can result in a lot of calls to your service from the ELB and Route53 for the health endpoint.

Related

what happens when health check fails in weighted policy on route 53?

I am wondering what happens when I have a weighted routing policy set up in my rout53 pointing at two different Ec2 or ALB and the health check fails for one of them?
What happens to the traffic which is on the failed weight target?
Will it fail? or route53 will send that weight also to the healthy target which will exceed the original weight assigned to the healthy targets.
TL;DR, it will ignore the unhealthy endpoint and re-calculate the weight.
For example, if you have 3 endpoints with a weight 1 for each of them. Normally, you will get 1/3 of the total traffic on each endpoint. If one of them fails the health check, then the remaining two healthy endpoints will each receive 1/2 of the total traffic.

Application ELB route traffic to new added instance before grace period

I have setup Auto scaling group and setup grace period to 300 (5mins). My new instance takes max 2.5 mins to boot up and ready to handle HTTP requests. But I am noticing that each time my new instance is added ELB starts forwarding traffic to new instance even way before grace period i.e 5mins. Due to which I am facing 502 Bad Gateway error.
Can anyone guide me why my application load balancer is behaving like it?
I am using ELB type health checks and below are settings of my target group health check
Protocol : HTTP
Port : 80
Healthy threshold : 2
Unhealthy threshold : 10
Timeout : 10
Interval : 150
Success codes : 200
This is a normal behavior. Grace period is not there to prevent health checks from happening. This holds true for both ELB and EC2 service health checks. During the grace period that you specify, both ELB and EC2 service will send health checks to your instance. The difference here is that auto-scaling will not act upon the results of these checks. Which means that auto-scaling will not automatically schedule instance for replacement.
Only after the instance is up and running correctly (passed ELB and EC2 health checks), will ELB register the instance and starts sending normal traffic to it. But this can happen before the grace period expires. If you see 502 Error after the instance has been registered with ELB then your problem is somewhere else.
Finally I resolved my issue. I am writing my solution here to help anyone else here facing same issue.
In my case, my initial feeling was that Application Load Balancer is routing traffic to newly added instance before it is ready to serve. But detailed investigation showed that was not the issue. In my case new instance was able to serve traffic at start and after few mins it was generating this ELB level 502 error for around 30 seconds and after that it starts working normally.
Solution:
The Application has a default connection KeepAlive of 60 seconds. Apache2 has a default connection KeepAlive of 5 seconds. If the 5 seconds are over, the Apache2 closes its connection and resets the connection with the ELB. However, if a request comes in at precisely the right time, the ELB will accept it, decide which host to forward it to, and in that moment, the Apache closes the connection. This will result in said 502 error code.
I set the ELB timeout to 60 seconds and the Apache2 timeout to 120 seconds. This solved my problem.

AWS - Does Elastic Load Balancing actually prevent LOAD BALANCER failover?

I've taken this straight from some AWS documentation:
"As traffic to your application changes over time, Elastic Load Balancing scales your load balancer and updates the DNS entry. Note that the DNS entry also specifies the time-to-live (TTL) as 60 seconds, which ensures that the IP addresses can be remapped quickly in response to changing traffic."
Two questions:
1) I was under the impression originally that a single static IP address would be mapped to multiple instances of an AWS load balancer, thereby causing fault tolerance on the balancer level, if for instance one machine crashed for whatever reason, the static IP address registered to my domain name would simply be dynamically 'moved' to another balancer instance and continue serving requests. Is this wrong? Based on the quote above from AWS, it seems that the only magic happening here is that AWS's DNS servers hold multiple A records for your AWS registered domain name, and after 60 seconds of no connection from the client, the TTL expires and Amazon's DNS entry is updated to only start sending requests to active IP's. This still takes 60 seconds on the client side of failed connection. True or false? And why?
2) If the above is true, would it be functionally equivalent if I were using a host provider of say, GoDaddy, entered multiple "A" name records, and set the TTL to 60 seconds?
Thanks!
The ELB is assigned a DNS name which you can then assign to an A record as an alias, see here. If you have your ELB set up with multiple instances you define the health check. You can determine what path is checked, how often, and how many failures indicate an instance is down (for example check / every 10s with a 5s timeout and if it fails 2 times consider it unhealthy. When an instance becomes unhealthy all the remaining instances still serve requests just fine without delay. If the instance returns to a healthy state (for example its passes 2 checks in a row) then it returns as a healthy host in the load balancer.
What the quote is referring to is the load balancer itself. In the event it has an issue or an AZ becomes unavailable its describing what happens with the underlying ELB DNS record, not the alias record you assign to it.
Whether or not traffic is effected is partially dependent on how sessions are handled by your setup. Whether they are sticky or handled by another system like elasticache or your database.

Customizing/Architecting AWS ELB to have Zero Downtime

So other day we faced an issue where one of the instance behind our application load balancer failed Instance Status Check and System Check. It took about 10 sec (the minimum we can get) for our ELB to detect this and mark the instance as "unhealthy", however we lost some amount of traffic in those 10 seconds as the ELB kept routing traffic to the unhealthy instance. Is there a solution where we can avoid literally any downtime or am I being too unrealistic?
I'm sure this isn't the answer you want to hear, but in order to minimize traffic loss on your systems if 10s is not tolerable, you'll need to implement your own health check/load balancing solution. My organization has systems where packet loss is unacceptable as well, and that's what we needed to do.
This solution is twofold.
You need to implement your own load-balancing infrastructure. We chose to use Route53 weighted record sets (TTL of 1s, we'll get back to this) with equal weight for each server
Launch an ECS container instance per load-balanced EC2 instance whose sole purpose is to health check. It runs both DNS and IP health checks (requests library in python) and will add/remove the Route53 weighted record real-time as it sees an issue.
In our testing, however, we discovered that while the upstream DNS servers from Route53 honor the 1 second TTL upon removal of a DNS record, they "blacklist" that record (FQDN + IP combo) from coming back up again for up to 10 minutes (we get variance of resolution times from 1m-10m). So you'll be able to failover quickly, but you must take into account it will take up to 10 minutes for the re-addition of the record to be honored.

Amazon ELB Health Check the second attempt interval after the first attempt failed

From Amazon docs:
Your load balancer to send request to http:// node IP address:80/index.html every 5 seconds. Allow 3 seconds for the web server to respond. If the load balancer does not get any response after 2 attempts, take the node out of service.
Does a load balancer wait for 5 seconds after the first request failed or does it perform the second request immediately?
In my experience, the health check interval is independent on whether or not previous health checks failed or succeeded. So if you have your health check interval set to 5 seconds, you will see checks every 5 seconds in your logs.
That page has an oddly worded description of load balancer health checks and I can't seem to find more detailed information about the health check configuration settings in any of the documentation.
While configure ELB there are few configurations on which health check depend.
Ping target- Elb will ping on this target using the protocol to get
200 status.
Timeout - This is the timeout period till ELB wait to get
the response, ELB terminates the connections if it’s idle for more
than timeout setting.
Interval - After this interval ELB will do the
health check on instances.
Unhealthy threshold - This is the maximum
number after ELB will mark the instance unhealthy and make it
outofservice (doesn’t mean terminated)
Healthy threshold - After this
threshold ELB will mark instance again healthy and make it inservice.
So if ELB getting first time unhealthy status from instance it will do health check again after the configured interval and it will not make instance unhealthy until the threshold crossed.
Two AWS target group heath check gotchas:
Interval is respected even if health check failed
Timeout is not part of the interval
If you have this setup
Healthy threshold 5 consecutive health check successes
Unhealthy threshold 3 consecutive health check failures
Interval 60 seconds
Timeout 5 seconds
Instance will be declared as unhealthy after 3:05 minutes of the first failed check. (3 x 60 seconds + 5 seconds)
Instance will be declared as healthy after 5 minutes of the first successful check. (5 x 60 seconds)