AWS autoscale ELB status checks grace period - amazon-web-services

I'm running servers in a AWS auto scale group. The running servers are behind a load balancer. I'm using the ELB to mange the auto scaling groups healthchecks. When servers are been started and join the auto scale group they are currently immediately join to the load balancer.
How much time (i.e. the healthcheck grace period) do I need to wait until I let them join to the load balancer?
Should it be only after the servers are in a state of running?
Should it be only after the servers passed the system and the instance status checks?

There are two types of Health Check available for Auto Scaling groups:
EC2 Health Check: This uses the EC2 status check to determine whether the instance is healthy. It only operates at the hypervisor level and cannot see the health of an application running on an instance.
Elastic Load Balancer (ELB) Health Check: This causes the Auto Scaling group to delegate the health check to the Elastic Load Balancer, which is capable of checking a specific HTTP(S) URL. This means it can check that an application is correctly running on an instance.
Given that your system is using an ELB health check, Auto Scaling will trust the results of the ELB health check when determining the health of each EC2 instance. This can be slightly dangerous because, if the instance takes a while to start, the health check could incorrectly mark the instance as Unhealthy. This, in turn, would cause Auto Scaling to terminate the instance and launch a replacement.
To avoid this situation, there is a Health Check Grace Period setting (in seconds) in the Auto Scaling group configuration. This indicates how long Auto Scaling should wait until it starts using the ELB health check (which, in turn, has settings for how often to check and how many checks are required to mark an instance as Healthy/Unhealthy).
So, if your application takes 3 minutes to start, set the Health Check Grace Period to a minimum of 180 seconds (3 minutes). The documentation does not state whether the timing starts from the moment that an instance is marked as "Running" or whether it is when the Status Checks complete, so perform some timing tests to avoid any "bounce" situations.
In fact, I would recommend setting the Health Check Grace Period to a significantly higher value (eg double the amount of time required). This will not impact the operation of your system since a Healthy Instance will start serving traffic as soon as the ELB Health Check is satisfied, which sooner than the Auto Scaling grace period. The worst case is that a genuinely unhealthy instance will be terminated a few minutes later, but this should be a rare occurrence.

the documentation (now) states "The grace period starts after the instance passes the EC2 system status check and instance status check."
So, at least according to the mid-2015 AWS documentation, the answer is "after the servers passed the system and the instance status checks." This is how we've set up our environment, and although I haven't done precise timings it appears to be correct.

If you closely monitor your cloudformation stack events, you will get success signal that your ASG got updated.
The time difference between ASG started updating and ASG received success signal is the health check grace period.
This health check grace period is always recommended to have double than the application startup time. Suppose, your application takes 10 min to start, you should put health check grace period for 20 min.
The reason is you never know your application might throw some kind of error and go for several retry.

Related

GCP Enable autoscale after startup script finished

I am trying to create autoscaler for instance group manager within GCP.
My problems:
Autoheal health check is tcp health check which verify if port 8443 Is opened
GCE instance startup script is very long and may least in extreme situation 3h
Is there anyway to change GCE instance state to RUNNING only when startup script is finished? Or another way to let autodial wait for startup script to finish before recreating instance?
Autoheal health check is tcp health check which verify if port 8443 Is opened
Create a health check for autohealing that is more conservative than a load balancing health check.
Create a health check that looks for a response on port 8843 and that can tolerate some failure before it marks VMs as UNHEALTHY and causes them to be recreated. In this example, a VM is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully 3 consecutive times.
a. In the Google Cloud Console, go to the Create a health check page.
b. Go to Create a health check
c. Give the health check a name, such as example-check.
d. For Protocol, make sure that TCP is selected.
e. For Port, enter 8843.
f. For Check interval, enter 5.
g. For Timeout, enter 5.
h. Set a Healthy threshold to determine how many consecutive successful health checks must be returned before an unhealthy VM is marked as healthy. Enter 1 for this example.
i. Set an Unhealthy threshold to determine how many consecutive unsuccessful health checks must returned before a healthy VM is marked as unhealthy. Enter 3 for this example.
j. Click Create to create the health check.
Running Instances
To pause a Autoheal until the MIG is stable, use the wait-until command with the --stable flag.
For full details you can check this link
For example:
gcloud compute instance-groups managed wait-until instance-group-name
--stable
[--zone zone | --region region]
Waiting for group to become stable, current operations: deleting: 4
Waiting for group to become stable, current operations: deleting: 4
...
Group is stable
You can refer to this link on Setting up a health check and an autohealing policy

Is it possible to tell AWS network load balancer that an instance is unhealthy without delay?

The AWS Network Load Balancer incurs a 90 second delay to deregister a target, and takes at least 10 seconds to notice that target is unhealthy.
Is there some API call that would inform the load balancer that the target is unhealthy immediately?
You could just deregister the target directly, for example via the CLI or Python SDK.
There is a way to decrease the health check interval to minimum 5 seconds in case of HTTP with threshold 2 so it means with 0 delay of deregistering it will not wait to elapse delay will improve a bit.
This the minimum setting for the TCP health check, you can tune this also if not, as deregistration start when the target is unhealthy so health check plays the main role.
For http
Deregistration Delay
Elastic Load Balancing stops sending requests to targets that are
deregistering. By default, Elastic Load Balancing waits 300 seconds
before completing the deregistration process, which can help in-flight
requests to the target to complete. To change the amount of time that
Elastic Load Balancing waits, update the deregistration delay value.
The initial state of a deregistering target is draining. After the
deregistration delay elapses, the deregistration process completes and
the state of the target is unused. If the target is part of an Auto
Scaling group, it can be terminated and replaced.
If a deregistering target has no in-flight requests and no active
connections, Elastic Load Balancing immediately completes the
deregistration process, without waiting for the deregistration delay
to elapse. However, even though target deregistration is complete,
the status of the target will be displayed as draining until the deregistration delay elapses.
load-balancer-target-groups-deregistaion-delay
Work Around:
One in my client project has the same requirement so I came up with a custom health check script that performs two tasks.
check the target group health check
If the target group seems healthy return status healthy to the LB
If the target group status is unhealthy then deregister the target immediately

AWS Codedeploy BlockTraffic/AllowTraffic durations

I've been using AWS CodeDeploy to push our applications live, but it always takes ages doing the BlockTraffic and AllowTraffic steps. Currently, I have an application load balancer(ALB) with three EC2 nodes initially(behind an autoscaling group). So, If I do a CodeDeploy OneAtATime, the whole process takes up to 25 minutes.
The load balancer I'm using it with had connection draining set to 300s. I thought it was the reason for drag out. However, I disabled Connection Draining and got the same results. I then enabled Connection Draining and set timeout to 5 seconds and still got the same results.
Further, I found out CodeDeploy depends on the ALB Health Check settings. according to the AWS documentation
After an instance is bound to the ALB, CodeDeploy waits for the
status of the instance to be healthy ("inService") behind the load
balancer. This health check is done by ALB and depends on the health
check configuration.
So I tried by setting low timeouts and thresholds for health check settings. Even those changes didn't reduce the deployment time much.
Can someone direct me to a proper solution to speed up the process?
The issue is the de-registration of instances from the AWS target group. You want to change this value:
or find a way to update the deregistration_delay.timeout_seconds property - by default it's 300s, which is 5 minutes. The docs can be found here).

AWS Health Check Restart API

I have an AWS load balancer with 2 ec2 instances serving an API in Python.
If I have 10K request come in at the same time, and the AWS health check comes in, the health check will fail, and there is a 502/504 gateway error because of instances restart due the to failed health check.
I check the instances CPU usage, max at 30%, and memory maxed at 25%.
What's the best option to fix this?
A few things to consider here:
Keep the health check API fairly light, but ensure that the health check API/URL indeed returns correct responses based on the health of the app.
You can configure the health check to mark the instance as failed only after X failed checks. You can tune this parameter and the Health check frequency to match your needs.
You can disable the EC2 restart from failed health-check by configuring your autoscaling group health-check type to EC2. This will prevent instances from being terminated due to a failed ELB health-check.

ELB always reports instances as inservice

I am using aws ELB to report the status of my instances to an autoscaling group so a non-functional instance would be terminated and replaced by a new one. The ELB is configured to ping TCP:3000 every 60 seconds and wait for a timeout of 10 seconds to consider it a health check failure. the unhealthy threshold is 5 consecutive checks.
However the ELB always reports my instances as healthy and inservice all the time even though I periodically manually come across an instance that is timing out and I have to terminate it manually and launch a new one despite the ELB reporting it as inservice all the time
Why does this happen ?
After investigating a little bit I found that
I am trying to assess the health of the app through an api callto a web app running on the instance and wait for the response to timeout to declare the instance faulty. I needed to use http as the protocol to call port 3000 with a custom path through the load balancer instead of tcp.
Note: The api needs to return a status code of 200 for the load balancer to consider it healthy. It now works perfectly