We've started using GCP Instance schedule for one of our VMs which needs to be up for 3 hours every night. For some reason, about once per week the VM is not up - services can't access it.
Checking from Logs Explorer, there are no errors or warnings, but on those days when it is not working, there are a few events which are not published/logged. These are the GCE Agent Started and OSConfig Agent Started events which happen on days where everything is OK (09-11, 09-12, 09-14) but are missing on days when the instance is not up (09-13).
The VM is Windows Server 2012 R2.
There is no retry policy implemented in the GCP instance schedule feature.
We know there are other ways to schedule VMs but we'd prefer to use the instance schedule feature if possible and if it is stable.
Is there somewhere else we should look for understanding why the VM is not starting properly?
This is the image from logs:
Instance schedules do not provide capacity guarantees, so if the resources required for a scheduled VM instance are not available at the scheduled time, your VM instance might not start when scheduled. Although you can reserve VM instances before starting them to provide capacity guarantees, reservations cannot be automatically scheduled.(Assuming that randomly VM instances are showing up this behaviour every week, not a particular VM every week.)
If it's with the same VM everytime then high memory utilization can also cause VM not being responsive. Manual reboot would fix this since it would close whatever is consuming the memory and re-open processes or services that may have been killed due to being OOM.
Please consider monitoring the VM memory usage by installing a monitoring agent, and increase the memory request based on the utilization.
Related
We have a VM server on GCP. Yesterday, the server stopped responding, we could not even SSH into the server, but everything was ok after restarting the server. I am having a look at the metrics and this is what I have noticed:
There is no Memory Utilization data for that period. Before this, the Memory Utilization was 90%.
Read Through Put is quite high; 13 MiB/s
What could have gone wrong? What else should I consider looking at?
Harith:
The applications processes running in your VM consumed the totality of the memory assigned to the VM.
Analyze each application hosted on the VM and evaluate its MTRs (Minimal Technical Requirements) and the actual work load that each one represent, this in order to estimate if the memory amount assigned is enough to support that load.
Consult log entries if available on those applications to see if they can reveal the consumption level just after the unresponsive condition.
Consider changing the machine type if you have to increase any resource capcity assigned to your vm.
If the resource consumption of your applications running on your VM will be very variable, you will need consider the implementation of autoscaling groups of instances.
From the official documentation on Google Cloud Platform, GPU instances get maintenance once in a while:
GPU instances must terminate for host maintenance events, but can
automatically restart. These maintenance events typically occur once
per week, but can occur more frequently when necessary. You must
configure your workloads to handle these maintenance events cleanly.
Specifically, long-running workloads like machine learning and
high-performance computing (HPC) must handle the interruption of host
maintenance events. Learn how to handle host maintenance events on
instances with GPUs.
Also, you can get the maintenance alert from google api one hour prior to the instance being shut down, according to the doc:
curl http://metadata.google.internal/computeMetadata/v1/instance/maintenance-event -H "Metadata-Flavor: Google"
My question is: If I terminate the GPU instance once I get the notification from the api, will the terminated GPU instance undergo maintenance as planned (after one hour) ?
The maintenance only need to restart your app to be applied. In fact, it's, most of the time, one underlying physical element to update/patch/change. The principle is to simply restart your app. WHY?? Because when you restart your app, it restart on another physical component. After all instances restart, the maintenance can be done by Google.
In your case, if the instance is terminated, and you start it, it will start on a "not in maintenance" physical infrastructure, so no impact for you.
Note: no patch are applied at the software/os level. Google is responsible of the underlying infrastructure (this maintenance), you are responsible of the OS/patching, bellow in the IaaS column
I had a vm instance running on Google Cloud, it's suggested me that "you should resize instance to 2CPU and 16GB RAM from 4CPU and 16GB RAM".
I pressed to Apply to set new config. Instance has stopped and stucked in resize process since an hour, neigher shows resized in gcloud instance list nor starting up.
Even try for taking snapshot of that vm's disk shows error that "it's being used in some operations"
Tried to force stop via gcloud, but no luck. In notification pop-up shows, resizing vm only.
Pls help me here.
The main reason for this issue is GCP resource availability which depends on users requests and therefore is dynamic. As result, issues like this could happen when you use cloud resources on-demand without reservation.
Let's have a look at the cause of this issue:
when you stop an instance it releases some resources like vCPU and memory;
when you start an instance it requests resources like vCPU and memory back;
when you resize your VM it's the same.
In case if there's not enough resources available in the zone you'll get an error message:
The zone 'projects/xyz-project-272905/zones/asia-south1-a' does not have enough resources available to fulfill the request. Try a different zone, or try again later..
more details you can find in the documentation:
If you receive a resource error (such as ZONE_RESOURCE_POOL_EXHAUSTED
or ZONE_RESOURCE_POOL_EXHAUSTED_WITH_DETAILS) when requesting new
resources, it means that the zone cannot currently accommodate your
request. This error is due to Compute Engine resource obtainability,
and is not due to your Compute Engine quota.
There are a few ways to solve your issue:
Move your instance to another zone by following instructions.
Wait for a while and try to resize your VM instance again.
Reserve resources for your VM by following documentation to avoid such issue in future (extra payment will be required):
Create reservations for Virtual Machine (VM) instances in a specific
zone, using custom or predefined machine types, with or without
additional GPUs or local SSDs, to ensure resources are available for
your workloads when you need them. After you create a reservation, you
begin paying for the reserved resources immediately, and they remain
available for your project to use indefinitely, until the reservation
is deleted.
We have an EC2 instance which becomes unreachable randomly. It has only started recently, and seems to only happen outside of business hours.
We are finding that the instance websites, WHM, SSH, even a terminal ping is all unreachable. However, the instance is running and health checks are fine in AWS console.
We used to have this with another instance but that just randomly stopped doing it at some point.
I have checked the CPU usage and the last 2 weeks, it has hit 100% 4 times but the times when that happened, are not when the instance goes down and I'm not sure they're even related.
The instance has WHM/cPanel installed, has not reached disk usage limit, nor bandwidth usage limit. We have cPHulk Brute Force Protection installed and running so surely can't be brute force attack?
It is resolved by stopping, then starting the instance, but we have clients viewing links and with the server going down outside of business hours and clients in different timezones.
I recommend you try installing a CloudWatch Agent to the EC2 instance in order to get the metrics and be able to analyze them further.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Install-CloudWatch-Agent.html
Ive read: https://cloud.google.com/compute/docs/instances/stopping-or-deleting-an-instance. In it they explain that:
You can stop an instance temporarily so you can come back to it at a
later time. A stopped instance does not incur charges, but all of the
resources that are attached to the instance will still be charged.
Alternatively, if you are done using an instance, delete the instance
and its resources to stop incurring charges.
However, i don't know how things are allocated in google cloud so im not sure what this means.
Say I stop my instance, I guess i am still using their storage, but im not using any CPU right? Does that mean that i will only pay for storage but not for the CPU and GPU hours?
Also how does google stop my instance? I know that if you suspend an instance in something like "Virtualbox" you can start it and start from when you left off. Is that what google does? Can I stop the instance halfway through running something and have it continue where it left off when i start the instance again?
When you stop your instance, you will not pay for CPU or GPU while the instance is stopped as the instance is not using them but you will be charged for resources attached as stated in the link you posted:
Your instances are not charged for per-second usage charges in TERMINATED state but any resources attached to the virtual machine will be charged until they are deleted, such as static IPs and persistent disks.
Google stops your instance by shutting it down, so you will lose data that is not in a persistent disk already.
From their docs:
When you shut down or delete an instance, Compute Engine sends the ACPI Power Off signal to the instance and waits a short period of time for your instance to shut down cleanly. If your instance is still running after this grace period, Compute Engine forcefully terminates it even if your shutdown script is still running.
There is a gcloud command that is in alpha that can suspend your VM: gcloud alpha compute instances suspend, you can read more in the docs here. It will only work on instances not using GPU or CSEK or preemptible VMs.
Compute Engine documentation says:
A stopped instance does not incur charges, but all of the resources
that are attached to the instance continue to incur charges. For
example, you are charged for persistent disks and external IP
addresses according to the price sheet, even if an instance is
stopped. To stop being charged for attached resources, you can
reconfigure a stopped instance to not use those resources, and then
delete the resources.