Shutdown scripts for Google managed instance groups

Shutdown scripts for Google managed instance groups - google-cloud-platform

I am running managed instance group in google cloud. These are behind a loadbalancer and it is working fine. The problem is when the managed instance group scales down, the loadbalancer will not notice this until after the instance has been killed so some requests will be sent to an instance that is dead causing the application to not work properly for a while.
On this page https://cloud.google.com/compute/docs/autoscaler/understanding-autoscaler-decisions I have read that shutdown scripts can be used. I tried to add one that tells the instance it will be shut down so it starts sending unhealthy when the load balancer does a health check, the script then waits for a while to make sure to give the load balancer time to check it. It does however not seem to work. The script seems to be called but to late so it just shuts down.
Anyone know how to write a shutdown script for this scenario?

Seems like this was not the problem after all. After inspecting the logs it turned out that the health checks timed out causing the load balancer to ot find any nodes with a healthy state.

Related

Why is the ALB Failing Health Checks on a Healthy Target?

I spent quite a lot of time trying to debug this, so I thought I'd post it in case anyone else had the same issue. I was trying to debug an ALB health check issue with Fargate. I could manually connect and see that everything was coming through. I could even connect to the Fargate instance and see that the health checks were coming through and being responded to appropriately. But the ALB kept reporting the health checks as failing.
In this specific case, I am using Tomcat as the server and Fargate as the destination, and the specific error message is "request timed out", but I think other setups (and even other error messages) conform to this case.

It turned out that the only problem was, on my service, I needed to GREATLY increase HealthCheckGracePeriodSeconds. This is the amount of time that the Load Balancer will wait before it starts counting health checks against you.
It turns out that there is quite a bit of latency between what the Load Balancer is doing and what it is reporting to you. By the time I was getting the "request timed out" error, the load balancer had already decided that my machine was failing health checks, but hadn't removed it from the pool yet. So, for me, it looked like it was running correctly, and the Load Balancer was still sending health checks even though it had made the decision to remove the machine from the pool. It's just the latency between the decisions being made on the Load Balancer, when it implemented those decisions, as well as when it reports them, caused quite a bit of confusion on my end.
So, if you are having problems with targets being added to your load balancer (in my case, it was a Tomcat server), especially on Fargate, be sure to check HealthCheckGracePeriodSeconds to be sure that you are giving it enough time to start all the way up. You can set it to a ridiculously high value to make sure (I think it can go up to 67 years).

Slow response when "Unmanaged Instance Group" added to HTTPS Load Balancer

HTTPS Load Balancer Proxy works great with Managed Instance Group but not with unmanaged instance group. We have added few Unmanaged Instance Group to the backend and have instructed Proxy to direct specific traffic to unmanaged group e.g. https://test.example.com to unmanaged instance group. When the testing is done we can stop the instances in unmanaged instance groups. However stopping individual VM instances with in managed group is not possible.
Every thing is working as expected. However, browser takes 10-15 seconds (not always but mostly) to display the page and randomly receives 500 error. It seems that instances in unmanaged group are stopped or Load Balancer does some house keeping which takes long to respond.
Any help or suggestions to fix the response time would be highly appreciated. Direct accessing the web server by avoiding the load balancer works as expected but https can't be used as only Proxy Server has the SSL certificate.

I'm taking an educated guess here based on your detailed description of symptoms.
As you noticed there's something going on "behind the scenes" of the load balancer and either health checks are failing or some other feature that is responsible for "updating" load balancer that test backend is shut off.
This shouldn't be happening and it looks like a bug.
At this poing I think the best way for you is report a new issue at Google's Issuetracker and include detailed description of what happens. You may link to this question too :)

Why AWS Classic Load Balancer Instances are OutOfService, but Auto Scale Group Shows InService and Healthy?

I tried to avoid asking this question, but after struggling for hours (or days) and reading all related materials, I'm desperately turning to SO.
So, I am trying to deploy my node/react project to AWS with a (classic) load balancer and auto scaling group. I got all the individual pieces working. Somehow, the instances in load balancer always show OutOfService, although those instances are InService and Healty in the auto scaling group. Why this disconnect?
Then, I added an elastic IP to one of the instances. I ssh'd to it and then ran "npm start" manually. Now this instance shows InService and Healthy in the load balancer.
It appears to me that it's not a security group issue, but that the start up script didn't get executed. This is my script:
#!/bin/bash
cd /home/ec2-user/projectname
npm start
Why not?
Some Update:
I enabled Access log for this balancer, and I got a lot of the (same) error logs. Here is one of them:
<Error>
<Code>AccessDenied</Code>
<Message>Access Denied</Message>
<RequestId>BC0FA4BB97BA1557</RequestId>
<HostId>r3wBXZLxJkTzm/SqcQnxEO+f9DhbtCxTLcVAn1vmllj6Dwa0xlO2psP3eEKOiuvNWY/Yb+Gt4C0=</HostId>
</Error>
This is not very helpful to me to figure out where the problem is.
More confusing is that I also get this kind of error log when the instance is started manually and running, and instance status in LB is healthy.
What is denied? health checker? Who is the health checker? This is my health check setting in balancer:
Ping Target HTTP:3000/
Timeout 5 seconds
Interval 30 seconds
Unhealthy threshold 2
Healthy threshold 10
The listener has HTTP 80 as load balance port and 3000 as instance port.
UPDATE again:
It appears to me that the real cause of the problem is that the start up script didn't run. I found a few suggestions around this problem like clear /var/lib/cloud folder or add #cloud-boothook to the top of the startup script, but nothing works for me.
UPDATE (3):
I couldn't make it work properly for me after a few days of struggle and give up now. However is a summary of what I learnt.
First off, I managed to follow up Ryan Lewis' PluralSight video and get it work as expected: Deploying to AWS with load balance and auto scaling. My project is very close to his "pizza-luvrs" project except that I'm using React front end and MongoDB. However, for some reason, I can't make it work for my own project.
My goal was to have load balancers work together with auto scaling group using a precreated AMI (with node, pm2 and my project installed). Using the below startup script with pm2, I got the server running at port 3000.
#!/bin/bash
echo "starting xxx..."
# restart pm2 and thus node app on reboot
crontab -l | { cat; echo "#reboot sudo pm2 start /home/ec2-user/xxx/server.js -i 0"; } | crontab -
# start the server on port 3000
pm2 start /home/ec2-user/xxx/server.js -i 0
echo "xxx started."
However, the instances in load balancer keeps saying "OutOfService", although the instances in auto scaling group always show InService. The most strange thing is that after I attach an Elastic IP (because my auto scaling instance is private) and SSH'd to the instance and without doing anything else, it becomes InService eventually (not always). I can then disassociate the Elastic IP and it keeps InService status. It sounds like the security group might be the cause of this problem, so I compared this with that "pizza-luvrs" project thousand times and made sure they have exactly the same setup. Still it works for his project but not for mine.
By the way, in AWS instances view, select an instance, then select the menu "Instance Settings" > "Get System Log", you can see how the instance gets started. This is how I can tell if my startup script in "user data" gets executed.

What you put in user data is executed once, on first boot, that's it. It's intended for setting up the instance the first time it's started. On subsequent startups it will not run. if you want to schedule a script to run on boot look up how to do it on whatever OS you're using.
Assuming you run some form of Linux that uses systemd, which is the most popular choice, have a look here
You can also execute user data on every boot, but it's really not worth the trouble.

ELB backend connection errors when deregister ec2 instances

I've written a custom release script to manage releases for an EC2 autoscaling application. The processing works like so...
Create an AMI based on an application git tag.
Create launch config.
Configure ASG to use new launch config.
Find current desired capacity for ASG.
Set desired capacity to 2x previous capacity.
Wait for new instances to become healthy by querying ELB.
Set desired capacity back to previous value.
This all works fairly well, except whenever I run this, the monitoring for the ELB is showing a lot of backend connection errors.
I don't know why this would be occurring, as it should (based on my understanding) still service current connections if the "Connection draining" option is enabled for the ELB (which it is).
I thought perhaps the ASG was terminating the instances before the connections could finish, so I changed my script to first deregister the instances from the ELB, and then wait a while before changing the desired capacity at the ASG. This however didn't make any difference. As soon as the instances were deregistered from the ELB (even though they're still running and healthy) the backend connection errors occur.
It seems as though it's ignoring the connection draining option and simply dropping connections as soon as the instance has been deregistered.
This is the command I'm using to deregister the instances...
aws elb deregister-instances-from-load-balancer --load-balancer-name $elb_name --instances $old_instances
Is there some preferred method to gracefully remove the instances from the ELB before removing them from the ASG?

Further investigation suggests that the back-end connection errors are occurring because the new instances aren't yet ready to take the full load when the old instances are removed from the ELB. They're healthy, but seem to require a bit more warming.
I'm working on tweaking the health-check settings to give the instances a bit more time before they start trying to serve requests. I may also need to change the apache2 settings to get them ready quicker.

ELB always reports instances as inservice

I am using aws ELB to report the status of my instances to an autoscaling group so a non-functional instance would be terminated and replaced by a new one. The ELB is configured to ping TCP:3000 every 60 seconds and wait for a timeout of 10 seconds to consider it a health check failure. the unhealthy threshold is 5 consecutive checks.
However the ELB always reports my instances as healthy and inservice all the time even though I periodically manually come across an instance that is timing out and I have to terminate it manually and launch a new one despite the ELB reporting it as inservice all the time
Why does this happen ?

After investigating a little bit I found that
I am trying to assess the health of the app through an api callto a web app running on the instance and wait for the response to timeout to declare the instance faulty. I needed to use http as the protocol to call port 3000 with a custom path through the load balancer instead of tcp.
Note: The api needs to return a status code of 200 for the load balancer to consider it healthy. It now works perfectly

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js