Health check in aws route53 - amazon-web-services

Health check in aws route53 - amazon-web-services

Q1.)While creating a record using failover-policy in route53-hosted-zone :
What is the difference between "evaluate-target-health" and "associate-health-check" ?
Q2.)Does route53 perform health check for EACH dns-request it receives ?

Both basically do the same thing. The key difference is that Evaluate Target Health is used with ALIAS records, e.g. a load balancer DNS endpoint and Associate with Health Check is used with A records, e.g. a web server's static IP address.
Evaluate Target Health does not use a health check that you create. Associate with Health Check uses a health check that you create.
Another way to compare the differences. Use Evaluate Target Health for an AWS service that manages its health status. Use Associate Health check for a service that you control and determine its health status thru your own health check.

Answer for your 1st question is already given by John Hanley. The answer for your 2nd question is NO.
According to AWS Documentation,
"If you associated a health check with a non-alias record, Route 53 checks the current status of the health check.
Route 53 periodically checks the health of the endpoint that is specified in a health check; it doesn't perform the health check when the DNS query arrives."
I hope that answers your question :)

Related

Route53 health check shows OK while the endpoint is down

I'm not sure how it's possible, but I set up a Route 53 health check with email alerting if our endpoint goes down.
It is definitely down because the EC2 hosting it is powered off.
❯ telnet foo.io 443
Trying 18.18.18.18...
telnet: connect to address 18.18.18.18: Operation timed out
telnet: Unable to connect to remote host
Is it possible that the checker has cached something? Although we don't use anything in between and it's supposed to hit the EC2 directly.

I think you have left your health check disabled
That's what the doc states
Stops Route 53 from performing health checks. When you disable a health check, Route 53 stops aggregating the status of the referenced health checks.
After you disable a health check, Route 53 considers the status of the health check to always be healthy. If you configured DNS failover, Route 53 continues to route traffic to the corresponding resources.
Maybe that's why you see it passing

ELB health check history / log

I have an ELB (Network Load Balancer with a couple of Auto Scaling Groups as Target Group) that has periodical health check fails (i.e. some instances would be marked as unhealthy and then recover after a few minutes). The health check is a simple static page (i.e. /health_check).
The timing seems to be at the same time when the host is having heavy network load (downloading large files from S3), but I want to have more information (e.g. they are failing the active health check or passive health check as mentioned in https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-health-checks.html).
However, I am not able to find the health check history or logs from ELB. All my search finds is about the Access Log for ELB (https://docs.aws.amazon.com/elasticloadbalancing/latest/network/load-balancer-access-logs.html) which is about the actual user requests.
Is this health check history / log accessible anywhere?

AWS Health Check Restart API

I have an AWS load balancer with 2 ec2 instances serving an API in Python.
If I have 10K request come in at the same time, and the AWS health check comes in, the health check will fail, and there is a 502/504 gateway error because of instances restart due the to failed health check.
I check the instances CPU usage, max at 30%, and memory maxed at 25%.
What's the best option to fix this?

A few things to consider here:
Keep the health check API fairly light, but ensure that the health check API/URL indeed returns correct responses based on the health of the app.
You can configure the health check to mark the instance as failed only after X failed checks. You can tune this parameter and the Health check frequency to match your needs.

You can disable the EC2 restart from failed health-check by configuring your autoscaling group health-check type to EC2. This will prevent instances from being terminated due to a failed ELB health-check.

Route 53 failover routing takes downtime of 6 to 8 minutes

I am getting "502 bad gateway error" between switching regions of Route 53 Failover.
Switching between primary to secondary takes 2-3 minutes if primary is down.
Meanwhile when on DR site IF primary comes up It will takes another 6 to 8 minutes for redirecting traffic from DR to primary. How to completely minimizes downtime from 6 to 8 minutes to 0?

You need to check how long it takes your ELB Health Check + Route53 Health Checks to determine a failover is required, the final step is the TTL of the DNS records.
For example, let's say you have a web application, hosted behind and ELB, and you are accessing it via myapp.mydomain.com.
ELB Health Check
While the primary thing you should check is the R53 health check (see below), the ELB configuration is also important.
Look at how long it should take to determine failure:
HealthCheck Interval - The amount of time between health checks
Unhealthy Threshold - How many failed health checks
Make sure this configuration is the same in ELBs in both regions.
Route53 Health Check
This is the main thing that will determine how long failover takes.
You probably have 2 CNAME records for myapp.mydomain.com, each pointing to a R53 health check, and each health check points at an ELB at it's respective region.
Check both health checks and make sure:
Request interval - How often R53 will poll your ELB for it's health.
Failure threshold - The number of consecutive health checks that an endpoint must pass or fail for the status to change.
Make sure both health check's config (Primary and Secondary) are the same.
Once the status changes, it's up to the DNS record TTL.
Route53 CNAME TTL
Check how long your CNAMES will point to a record after a failover by looking at the record TTL. For example, if TTL is 30, it will take approx. 30 seconds for Route53 to start pointing to the secondary region.
Make sure both CNAME records have the same TTL.
After following this you can determine how long it should take to failover, for example:
Your health checks are looking at port 80:/ availability, your health checks take approx. 30 seconds, and your apache dies on the primary site.
Within 30 (example) seconds ELB will determine instances out of service and stop forwarding traffic.
Within the same 30 (example) seconds the R53 health check which is monitoring the same healthcheck (port 80:/) will also determine primary ELB is unhealthy.
This is where R53 decides to start pointing DNS queries to your secondary ELB.
If your TTL is set to 30, failover should be completed in approx. 1 minute, +/- some time for propagation, etc.
Make sure not to set your health checks to be too frequent, depending on how many instances are behind your ELB, it can result in a lot of calls to your service from the ELB and Route53 for the health endpoint.

Temporary invert primary <=> secondary on AWS Route53 DNS Failover setup

I have a Route53 DNS Failover setup associated with Health Checks running fine but periodically we need to invert primary <=> secondary just for an hour or two for maintenance.
How is the best practice for this? Any simple way to achieve that?
Any input is appreciated!!
Best regards,

I'm not sure if it's best practice but you can do the following.
mysite.example.com failover (primary) (evaluate target health + cacluated health check attached) --> site1.example.com (associated with regular hc)
mysite.example.com failover (secondary) (evaluate target health + cacluated health check attached) --> site2.example.com (associated with regular hc)
You can create two calculated health checks with no children. Associate the health checks with each site. If the child health check (site1) becomes unhealthy, Route 53 to failover to site2. If you invert the status of your calculated health check, Route 53 will then failover to site 2. When you are down, uninvert the health check.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js