GCP VM stopped responding - google-cloud-platform

We have a VM server on GCP. Yesterday, the server stopped responding, we could not even SSH into the server, but everything was ok after restarting the server. I am having a look at the metrics and this is what I have noticed:
There is no Memory Utilization data for that period. Before this, the Memory Utilization was 90%.
Read Through Put is quite high; 13 MiB/s
What could have gone wrong? What else should I consider looking at?

Harith:
The applications processes running in your VM consumed the totality of the memory assigned to the VM.
Analyze each application hosted on the VM and evaluate its MTRs (Minimal Technical Requirements) and the actual work load that each one represent, this in order to estimate if the memory amount assigned is enough to support that load.
Consult log entries if available on those applications to see if they can reveal the consumption level just after the unresponsive condition.
Consider changing the machine type if you have to increase any resource capcity assigned to your vm.
If the resource consumption of your applications running on your VM will be very variable, you will need consider the implementation of autoscaling groups of instances.

Related

GCP VM instance schedule randomly not starting VM

We've started using GCP Instance schedule for one of our VMs which needs to be up for 3 hours every night. For some reason, about once per week the VM is not up - services can't access it.
Checking from Logs Explorer, there are no errors or warnings, but on those days when it is not working, there are a few events which are not published/logged. These are the GCE Agent Started and OSConfig Agent Started events which happen on days where everything is OK (09-11, 09-12, 09-14) but are missing on days when the instance is not up (09-13).
The VM is Windows Server 2012 R2.
There is no retry policy implemented in the GCP instance schedule feature.
We know there are other ways to schedule VMs but we'd prefer to use the instance schedule feature if possible and if it is stable.
Is there somewhere else we should look for understanding why the VM is not starting properly?
This is the image from logs:
Instance schedules do not provide capacity guarantees, so if the resources required for a scheduled VM instance are not available at the scheduled time, your VM instance might not start when scheduled. Although you can reserve VM instances before starting them to provide capacity guarantees, reservations cannot be automatically scheduled.(Assuming that randomly VM instances are showing up this behaviour every week, not a particular VM every week.)
If it's with the same VM everytime then high memory utilization can also cause VM not being responsive. Manual reboot would fix this since it would close whatever is consuming the memory and re-open processes or services that may have been killed due to being OOM.
Please consider monitoring the VM memory usage by installing a monitoring agent, and increase the memory request based on the utilization.

Google Cloud Instance downtime issue

I am using Google Cloud Instance for 1 of my website.
But daily at same time my server went down. you can say that only 1 - 10 minutes difference daily maximum
When I checks in monitoring it shows me that Disk Throughput (Write) is very high.
I changed disk as well as using N2 Type machine
Waiting for Suggestions.
Thanks
For this scenarios usually an application running in your VM is consuming more resources than the VM has.
You could also review if there is any peak at the same time for CPU utlization and or if there is any peak network traffic this could point to to http requests overlading your vm.
As shot term solution you could add more persistant disk and change the machien type to increse the I/O disk performance , for reference you can review the article Optimizing persistent disk performance

Why EC2 instance is not accessible to others

I deployed the Machine Learning classification model in AWS EC2 (UBUNTU)instance successfully. I am able to access the instance "http://ec2-18-191-31-0.us-east-2.compute.amazonaws.com" and predictions are working fine only for few minutes. After that I or my colleagues are not able to access this. Getting an error "cannot connected to the server".
Security group that I crated as attached.
t2.micro instances are not suitable for any long running calculations. They are burstable. This means that their performance can be sustained only for short periods of time, e.g., sudden, short lived spikes in CPU usage. On top of that they have only 1 GB of RAM which limits its usefulness in machine learning.
For calculations, you could consider Compute optimized or Memory optimized instances. Obviously, these instance types are not free, but they are suited for calculations.
You can change instance type if you want and test with other, more power types. What you are describing indicates that your t2.micro exhausts all its RAM and/or CPU burst credits after few minutes and it freezes.
You can use CloudWatch Metrics for EC2 to monitor your instances and observer its CPU utilization and other metrics which can help you determine what exactly is causing the backlog. You can also monitor RAM and disc usage but this requires CloudWatch Agent setup on the instance.

Azure VM Inbound Throttling to VMs?

We have 2 Elastic VMs (Linux) (Currently DS2V2) behind an Azure Load Balancer. We are doing HTTP Posts from our local lan into the Load Balancer, but we seem to be getting throttled. We have tried: Changing the size of the VMs, no difference; adding additional premium SSDs, again no difference; running multiple threads on our end, again no differenece.
What we did do though, was to having the Elastic Engine suck in all of the log files from the Linux boxes and the index rate jump pretty high while it was ingesting them. So we are assuming that it's not really the Linux Elastic boxes that are throttling us.
We do have Kibana installed on the boxes, and as a base line, we're just using the "Cluster Indexing Rate" for both our local posts to the box, and the local ingestion of the log files.
We do understand that yes, there is going to be some latency and overhead since we are now involving the internet, but not the rates we are currently getting. (We have a 1G pipe to the internet, it's nowhere near capacity, so we can rule out at least getting out of our company).
The question is, where else can we look to determine where we might be getting throttled?
For the performance "MUCH slower", it is a bit subjective question and hard to identify. I just provide some information that may impact it.
Azure Compute requests may be throttled at a subscription and on a per-region basis. If you have an API throttling error, you could refer to this document to troubleshoot throttling issues, and best practices to avoid being throttled.
Some factors CPU and storage limits that differ on Azure VM sizes may impact the Azure VM to process incoming data. You may change the size to a higher CPU and premium SSD disk. You could also change Azure resources to another region which is close to your location. You could refer to this article.

Optimizing azure virtual machine size

I have a web app hosted on Ubuntu-based Azure classic virtual machine (size DS14). The CPU usage, load, memory, disk I/O and network I/O changes over the previous 7 days are as follows:
Clearly, there's opportunity to save money here by scaling my infrastructure dynamically up and down alongwith changes in load, instead of having a DS14 instance running all the time.
Can someone please outline the steps I'll need to do to enable this? My VM is not part of any availability set as of now.
You could add a classic VM to an availability set. Please refer to this link:Add an existing virtual machine to an availability set.
Notes: ARM VM does not support add an existing VMs to an availability set.
If you want to VM supports dynamically auto scale, you need at least two VMs in a same availability set. You could refer to this link:Automatic scale - CPU.
According to your description, you want to your VM auto up and down, I think it is not possible. When VM is up and down, the VM needs restart, your service will be interrupted for a minutes. As a production environment, this is not acceptable.