preemptible VM's in managed instance group go into terminated state - google-cloud-platform

I have a Managed Instance Group made up of a set of preemptible VM's - they are ephemeral and can be preempted anytime (our group is large enough to sustain losing several VM's at once) - for the most part the MIG will bring the VM count back up to the desired level on VM preemption - occasionally a node goes into the terminated state and the MIG still counts it as available and does nothing to correct the issue - so I am down a one or more vms. My understanding of the terminated state is that "TERMINATED. A user shut down the instance, or the instance encountered a failure. You can choose to restart the instance or delete it". Given that we didnt shut the instance down it must have encountered some failure - logs dont indicate anything other than the node was pre-empted. How can I configure my instance group to delete/recreate VM's that end up in this state?

Reading your question I understand that you want to know why VM's terminated all the time right?
As you mentioned that you are using a Managed Instance Group with preemptible VM's, this means that the VM's are always terminated in 24 hours (or less) according to this document.
Other than that, maybe you want to be sure what happened on your instance in the last hours, for that I recommend you to open SSH in your instance and use "journalctl" as example:
journalctl -b --since "2021-03-04 00:00:00" | grep 'terminated'
This command will look for all the "terminated" statements from the given timestamp to the moment you run the command.
If you don't care about the termination or your VM's every 24 hours I don't see a problem of using preemptible VM's. But if this is causing you problems in your operation I would suggest you to turn off the preemptible feature and let the load balancer to act according your needs.
Jose.

Related

GCP VM can't start or move TERMINATED instance

I'm running into a problem starting my Google Cloud VM instance. I wanted to restart the instance so I hit the stop button but this was just the beginning of a big problem.
start failed with an error that the zone did not have enough capacity. Message:
The zone 'XXX' does not have enough resources available to fulfill the request. Try a different zone, or try again later.
I tried and retried till I decided to move it to another zone and ran:
gcloud compute instances move VM_NAME --destination-zone NEW_ZONE
I then get the error:
Instance cannot be moved while in state: TERMINATED
What am I supposed to do???
I'm assuming that this is a basic enough issue that there's a common way to solve for this.
Thanks
Edit: I have since managed to start the instance but would like to know what to do next time
The correct solution depends on your criteria.
I assume you're using Preemptible instances for their cost economies but -- as you've seen, there's a price -- sometimes non-preemptible resources are given the priority and sometimes (more frequently than for regular cores) there are insufficient preemptible cores available.
While it's reasonable to want to, you cannot move stopped instances between zones in a region.
I think there are a few options:
Don't use Preemptible. You'll pay more but you'll get more flexibility.
Use Managed Instance Groups (MIGs) to maintain ~1 instance (in the region|zone)
(for completeness) consider using containers and perhaps Cloud Run or Kubernetes
You describe wanting to restart your instance. Perhaps this was because you made some changes to it. If this is the case, you may wish to consider treating your instances as more being more disposable.
When you wish to make changes to the workload:
IMPORTANT ensure you're preserving any important state outside of the instance
create a new instance (at this time, you will be able to find a zone with capacity for it)
once the new instance is running correctly, delete the prior version
NB Both options 2 (MIGs) and 3 (Cloud Run|Kubernetes) above implement this practice.

Amazon EC2 instance passed 1/2 checks

Newbie to Amazon Web Services here. I launched an instance from a Public AMI and found that I could not ssh into the instance - I received the error "Connection timed out." I checked the security groups to verify that Port 22 was associated with 0.0.0.0/0. Additionally, I checked the route tables to verify that 0.0.0.0/0 is associated with target gateway attached to the VPC.
I find that only 1/2 status checks have passed - the instance status check failed. I have tried stopping and starting the instance as well as terminated and launching a new instance, both to no avail. The error that I see in the system log is:
Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(8,1).
From this previous question, it appears that this could be a virtualization issue, but I'm not sure if that was due to something I did on my end when launching the instance or something that occurred from the creators of the AMI? Ec2 1/2 checks passed
Any help would be appreciated!
Can you share any more details about how you deployed the instance? Did you use the AWS Management Console, or one of the command line tools or SDKs to deploy it? Which public AMI did you use? Was it one of the ones provided by Amazon?
Depending on your needs, I would make sure that you use one of the AMIs provided by Amazon, such as Ubuntu, Amazon Linux, CentOS, etc. Here's the links to the docs on AMIs, but you can learn quite a bit by just searching for images. Since you mentioned virtualization types though, I'd suggest reading up briefly on the HVM vs. Paravirtual virtualization types on AWS. Each of the instance types / families uses a certain virtualization type, which is indicated in the chart on this page.
Instance Status Checks
This documentation page covers the instance status checks, which you'll probably want to familiarize yourself with. It's entirely possible that shutting down (not restart, but shutdown) and then starting the instance back up might resolve the instance status check.
Spot Instances - cost savings!
By the way, I'll just mention this since you indicated that you're new to AWS ... if you're just playing around right now, you can save a ton of cost by deploying EC2 Spot Instances, instead of paying the normal, on-demand rates. Depending on current rates, you can save more than 50%, and per-second billing still applies. Although there's the possibility that your EC2 instance could get "interrupted" based on market demand, you can configure your Spot Instance to just "Hibernate" or "Stop" instead of terminating and relaunching. That way, your work is instance state is saved for when it relaunches.
Hope this helps!
1) Use well-known images or contact with the image developer. Perhaps it requires more than one drive or tricky partitioning.
2) make sure you selected proper HVM/PV image according to the instance type.
3) (after checks are passed) make sure the instance has public ip

Why does Google Cloud Compute API consider a stopped instance a "TERMINATED" status?

Trying to understand what GCP's choice in terminology means - by comparison, in AWS, a terminated instance is gone forever, but a stopped instance can be restarted.
If you stop an instance in GCP console, the status returned by the GCP Compute API for that instance is TERMINATED, however, you can start the instance.
Is this a bug in the way GCP reports instance statuses, or is a TERMINATED status an indication that some resources have actually been terminated/deleted?
Terminated status is probably how it was when GCE became generally available.
Terminated status means that the VM instance is no longer running (no more writing to disk, no pings etc, it's powered off). The VM instance can be started again which means a new VM process is being created for it. This VM may be configured slightly different (example: different IP address - use static IPs if you want the same one). You are not billed for stopped instances.
VMs can be terminated for 3 reasons:
You stopped it.
There's some unexpected failure that happened before
the VM could be live migrated.
If the VM is preemptible, it hit the
24 hour lifetime limit or was preempted.
Docs:
https://cloud.google.com/compute/docs/instances/stopping-or-deleting-an-instance#stop_an_instance
https://cloud.google.com/compute/docs/instances/checking-instance-status#instance_statuses

Ec2 1/2 checks passed

Since today i can't access my instance, i tried stop and restart several times but the status is always : "1/2 checks passed"
I tried to create a snapshot, detach and reattach new volume but the result is the same.
I also tried to create another instance and attach the volume and it's not starting either.
Any help ?
The status checks automatically performed on Amazon EC2 instances are:
System Status Checks: These check the underlying systems used by the Amazon EC2 instance
Instance Status Checks: These check the configuration of the specific instance
See documentation: Status Checks for Your Instances
Often, an instance is available and ready to be used before these checks are complete -- this is especially the case for Linux instances because they boot very quickly.
If you receive a 1/2 checks passed message, either wait a little longer or Stop and Start the instance. Performing a Stop/Start will launch the instance on a different host, which will probably fix whatever problem was being experienced.
If the 1/2 checks passed message continues to appear after a Stop/Start, it is probably a misconfiguration of the AMI. I have seen this when the wrong virtualization type was selected for an AMI that was created from a Snapshot.
You might be able to get a hint about the problem by using the Get System Log command in the Actions menu, which shows the log while the instance is booting.
Worst case, launch a new instance from a known-good AMI, attach the non-booting volume as an additional disk and copy files to the new disk. You will still have access to your files even if it will not boot.
You can check the description of the checks here, and understand which one is not working...

AWS AutoScaling 'oldestinstance' Termination Policy does not always terminate oldest instances

Scenario
I am creating a script that will launch new instances into an AutoScaling Group and then remove the old instances. The purpose is to introduce newly created (or updated) AMI's to the AutoScaling Group. This is accomplished by increasing the Desired capacity by double the current number of instances. Then, after the new instances are Running, decreasing the Desired capacity by the same number.
Problem
When I run the script, I watch the group capacity increase by double, the new instances come online, they reach the Running state, and then the group capacity is decreased. Works like a charm. The problem is that SOMETIMES the instances that are terminated by the decrease are actually the new ones instead of the older ones.
Question
How can I ensure that the AutoScaling Group will always terminate the Oldest Instance?
Settings
The AutoScaling Group has the following Termination Polices: OldestInstance, OldestLaunchConfiguration. The Default policy has been removed.
The Default Cooldown is set to 0 seconds.
The Group only has one Availability Zone.
Troubleshooting
I played around with the Cooldown setting. Ended up just putting it on 0.
I waited different lengths of time to see if the existing servers needed to be running for a certain amount of time before they would be terminated. It seems that if they are less than 5 minutes old, they are less likely to be terminated, but not always. I had servers that were 20 minutes old that were not terminated instead of the new ones. Perhaps newly launched instances have some termination protection grace period?
Concession
I know that in most cases, the servers I will be replacing will have been running for a long time. In production, this might not be an issue. Still, it is possible that during the normal course of AutoScaling, an older server will be left running instead of a newer one. This is not an acceptable way to operate.
I could force specific instances to terminate, but that would defeat the point of the OldestInstance Termination Policy.
Update: 12 Feb 2014
I have continued to see this in production. Instances with older launch configs that have been running for weeks will be left running while newer instances will be terminated. At this point I am considering this to be a bug. A thread at Amazon was opened for this topic a couple years ago, apparently without resolution.
Update: 21 Feb 2014
I have been working with AWS support staff and at this point they have preliminarily confirmed it could be a bug. They are researching the problem.
It doesn't look like you can, precisely, because auto-scaling is trying to do one other thing for you in addition to having the correct number of instances running: keep your instance counts balanced across availability zones... and it prioritizes this consideration higher than your termination policy.
Before Auto Scaling selects an instance to terminate, it first identifies the Availability Zone that has more instances than the other Availability Zones used by the group. If all Availability Zones have the same number of instances, it identifies a random Availability Zone. Within the identified Availability Zone, Auto Scaling uses the termination policy to select the instance for termination.
— http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/us-termination-policy.html
If you're out of balance, then staying in balance is arguably the most sensible strategy, especially if you are using ELB. The documentation is a little ambiguous, but ELB will advertise one public IP in the DNS for each availability zone where it is configured; these three IP addresses will achieve the first tier of load balancing by virtue of round-robin DNS. If all of the availability zones where the ELB is enabled have healthy instances, then there appears to be a 1:1 correlation between which external IP the traffic hits and which availability zone's servers that traffic will be offered to by ELB -- at least that is what my server logs show. It appears that ELB doesn't route traffic across availability zones to alternate servers unless all of the servers in a given zone are detected as unhealthy, and that may be one of the justifications of why they've implemented autoscaling this way.
Although this algorithm might not always kill the oldest instance first on a region-wide basis, if it does operate as documented, it would kill off the oldest one in the selected availability zone, and at some point it should end up cycling through all of them over the course of several shifts in load... so it would not leave the oldest running indefinitely, either. The larger the number of instances in the group is, it seems like the less significant this effect should be.
There are a couple of other ways to do it:
Increase desired to 2x
Wait for action to increase capacity
When the new instances are running, suspend all AS activity (as-suspend-processes MyAutoScalingGroup)
Reset desired
Terminate old instances
Resume AS activity.
Or:
Bring up a brand new ASG with the new launch config.
Suspend AS activity , until 1. is finished.
If everything is ok, delete the old ASG.
Resume AS activity
For ultimate rollback deployment:
Create new ELB (might have to ask Amazon to provision more elb if you have a lot of traffic, this is kinda lame and makes it not automation friendly)
Create new ASG with new LC
Switch DNS to new ELB
Delete old ELB/ASG/LC if everything's fine, if not just change DNS back
Or with the new ASG API that lets you attach/detach instances from ASG:
Somehow bring up your new instances (could just be run-instances or create a temp asg)
Suspend AS activity, Use http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/attach-instance-asg.html to attach them to your old ASG,
Use http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/detach-instance-asg.html or terminate your old instances
Resume AS activity
The reason you might want to use your old ASG is because it can be a pita to set all the policies again (even when automated) and it feels a bit safer to change as little as possible.
A.
My use case is that we needed to scale down and be able to choose which machines go down. Unfortunately the termination policy "OldestFirst" was not working for us either. I was able to use a variant of the attach/detach method that ambakshi shared to remove the oldest (or any instance I choose) and at the same time lower the desired instances value of the autoscaling group.
Step 1 – Change the autoscaling group Min value to the number you want to scale down to.
Step 2 – Suspend the ASG
Step 3 – Detach the instances you want to terminate, you can do multiple instances in one command. Make sure to use the should-decrement-desired-capacity flag
Step 4 – Resume the ASG
Step 5 – Terminate your instances using the console or the CLI
UPDATE
There is no need to suspend the Auto Scaling Group, just doing steps 1, 3 and 5 worked for me. Just be aware of any availability zone balancing that may happen.