GCP Enable autoscale after startup script finished - google-cloud-platform

I am trying to create autoscaler for instance group manager within GCP.
My problems:
Autoheal health check is tcp health check which verify if port 8443 Is opened
GCE instance startup script is very long and may least in extreme situation 3h
Is there anyway to change GCE instance state to RUNNING only when startup script is finished? Or another way to let autodial wait for startup script to finish before recreating instance?

Autoheal health check is tcp health check which verify if port 8443 Is opened
Create a health check for autohealing that is more conservative than a load balancing health check.
Create a health check that looks for a response on port 8843 and that can tolerate some failure before it marks VMs as UNHEALTHY and causes them to be recreated. In this example, a VM is marked as healthy if it returns successfully once. It is marked as unhealthy if it returns unsuccessfully 3 consecutive times.
a. In the Google Cloud Console, go to the Create a health check page.
b. Go to Create a health check
c. Give the health check a name, such as example-check.
d. For Protocol, make sure that TCP is selected.
e. For Port, enter 8843.
f. For Check interval, enter 5.
g. For Timeout, enter 5.
h. Set a Healthy threshold to determine how many consecutive successful health checks must be returned before an unhealthy VM is marked as healthy. Enter 1 for this example.
i. Set an Unhealthy threshold to determine how many consecutive unsuccessful health checks must returned before a healthy VM is marked as unhealthy. Enter 3 for this example.
j. Click Create to create the health check.
Running Instances
To pause a Autoheal until the MIG is stable, use the wait-until command with the --stable flag.
For full details you can check this link
For example:
gcloud compute instance-groups managed wait-until instance-group-name
--stable
[--zone zone | --region region]
Waiting for group to become stable, current operations: deleting: 4
Waiting for group to become stable, current operations: deleting: 4
...
Group is stable
You can refer to this link on Setting up a health check and an autohealing policy

Related

Marking a compute instance as busy to prevent disrupting connections

I have a Golang service using TCP running on GCP's compute VMs with autoscaling. When the CPU usage spikes, new instances are created and deployed (as expected), but when the CPU usage settles again the instances are destroyed. This would be fine and it's entirely reasonable as to why this is done, but destroying instances does not take into account the established TCP connections and thus disconnects users.
I'd like to keep the VM instances running until the last connection has been closed to prevent disconnecting users. Is there a way to mark the instance as "busy" telling the autoscaler not to remove that instance until it isn't busy? I have implemented health checks but these do not signal the busyness of the instance, only whether the instance is alive or not.
You need to enable Connection Draining for your auto-scaling group:
If the group is part of a backend service that has enabled connection draining, it can take up to 60 seconds after the connection draining duration has elapsed before the VM instance is removed or deleted.
Here are the steps on how to achieve this:
Go to the Load balancing page in the Google Cloud Console.
Click the Edit button for your load balancer or create a new load balancer.
Click Backend configuration.
Click Advanced configurations at the bottom of your backend service.
In the Connection draining timeout field, enter a value from 0 - 3600. A setting of 0 disables connection draining.
Currently you can request connection draining timeout upto 3600s (= 1hour) which should be suffice for your requirements.
see: https://cloud.google.com/compute/docs/autoscaler/understanding-autoscaler-decisions

AWS Codedeploy BlockTraffic/AllowTraffic durations

I've been using AWS CodeDeploy to push our applications live, but it always takes ages doing the BlockTraffic and AllowTraffic steps. Currently, I have an application load balancer(ALB) with three EC2 nodes initially(behind an autoscaling group). So, If I do a CodeDeploy OneAtATime, the whole process takes up to 25 minutes.
The load balancer I'm using it with had connection draining set to 300s. I thought it was the reason for drag out. However, I disabled Connection Draining and got the same results. I then enabled Connection Draining and set timeout to 5 seconds and still got the same results.
Further, I found out CodeDeploy depends on the ALB Health Check settings. according to the AWS documentation
After an instance is bound to the ALB, CodeDeploy waits for the
status of the instance to be healthy ("inService") behind the load
balancer. This health check is done by ALB and depends on the health
check configuration.
So I tried by setting low timeouts and thresholds for health check settings. Even those changes didn't reduce the deployment time much.
Can someone direct me to a proper solution to speed up the process?
The issue is the de-registration of instances from the AWS target group. You want to change this value:
or find a way to update the deregistration_delay.timeout_seconds property - by default it's 300s, which is 5 minutes. The docs can be found here).

AWS Health Check Restart API

I have an AWS load balancer with 2 ec2 instances serving an API in Python.
If I have 10K request come in at the same time, and the AWS health check comes in, the health check will fail, and there is a 502/504 gateway error because of instances restart due the to failed health check.
I check the instances CPU usage, max at 30%, and memory maxed at 25%.
What's the best option to fix this?
A few things to consider here:
Keep the health check API fairly light, but ensure that the health check API/URL indeed returns correct responses based on the health of the app.
You can configure the health check to mark the instance as failed only after X failed checks. You can tune this parameter and the Health check frequency to match your needs.
You can disable the EC2 restart from failed health-check by configuring your autoscaling group health-check type to EC2. This will prevent instances from being terminated due to a failed ELB health-check.

ELB always reports instances as inservice

I am using aws ELB to report the status of my instances to an autoscaling group so a non-functional instance would be terminated and replaced by a new one. The ELB is configured to ping TCP:3000 every 60 seconds and wait for a timeout of 10 seconds to consider it a health check failure. the unhealthy threshold is 5 consecutive checks.
However the ELB always reports my instances as healthy and inservice all the time even though I periodically manually come across an instance that is timing out and I have to terminate it manually and launch a new one despite the ELB reporting it as inservice all the time
Why does this happen ?
After investigating a little bit I found that
I am trying to assess the health of the app through an api callto a web app running on the instance and wait for the response to timeout to declare the instance faulty. I needed to use http as the protocol to call port 3000 with a custom path through the load balancer instead of tcp.
Note: The api needs to return a status code of 200 for the load balancer to consider it healthy. It now works perfectly

AWS autoscale ELB status checks grace period

I'm running servers in a AWS auto scale group. The running servers are behind a load balancer. I'm using the ELB to mange the auto scaling groups healthchecks. When servers are been started and join the auto scale group they are currently immediately join to the load balancer.
How much time (i.e. the healthcheck grace period) do I need to wait until I let them join to the load balancer?
Should it be only after the servers are in a state of running?
Should it be only after the servers passed the system and the instance status checks?
There are two types of Health Check available for Auto Scaling groups:
EC2 Health Check: This uses the EC2 status check to determine whether the instance is healthy. It only operates at the hypervisor level and cannot see the health of an application running on an instance.
Elastic Load Balancer (ELB) Health Check: This causes the Auto Scaling group to delegate the health check to the Elastic Load Balancer, which is capable of checking a specific HTTP(S) URL. This means it can check that an application is correctly running on an instance.
Given that your system is using an ELB health check, Auto Scaling will trust the results of the ELB health check when determining the health of each EC2 instance. This can be slightly dangerous because, if the instance takes a while to start, the health check could incorrectly mark the instance as Unhealthy. This, in turn, would cause Auto Scaling to terminate the instance and launch a replacement.
To avoid this situation, there is a Health Check Grace Period setting (in seconds) in the Auto Scaling group configuration. This indicates how long Auto Scaling should wait until it starts using the ELB health check (which, in turn, has settings for how often to check and how many checks are required to mark an instance as Healthy/Unhealthy).
So, if your application takes 3 minutes to start, set the Health Check Grace Period to a minimum of 180 seconds (3 minutes). The documentation does not state whether the timing starts from the moment that an instance is marked as "Running" or whether it is when the Status Checks complete, so perform some timing tests to avoid any "bounce" situations.
In fact, I would recommend setting the Health Check Grace Period to a significantly higher value (eg double the amount of time required). This will not impact the operation of your system since a Healthy Instance will start serving traffic as soon as the ELB Health Check is satisfied, which sooner than the Auto Scaling grace period. The worst case is that a genuinely unhealthy instance will be terminated a few minutes later, but this should be a rare occurrence.
the documentation (now) states "The grace period starts after the instance passes the EC2 system status check and instance status check."
So, at least according to the mid-2015 AWS documentation, the answer is "after the servers passed the system and the instance status checks." This is how we've set up our environment, and although I haven't done precise timings it appears to be correct.
If you closely monitor your cloudformation stack events, you will get success signal that your ASG got updated.
The time difference between ASG started updating and ASG received success signal is the health check grace period.
This health check grace period is always recommended to have double than the application startup time. Suppose, your application takes 10 min to start, you should put health check grace period for 20 min.
The reason is you never know your application might throw some kind of error and go for several retry.