get alert when instance is not active in GCE - google-cloud-platform

For Compute Engine of GCE, I use stackdriver monitoring for monitoring and alert.
For most of the general metrics like CPU, disk IO, memory ... etc is available and can set alert for those metrics based or dead-or-alive by process name.
However I cannot find any metrics related to status of GCE instance itself.
My use-case is so simply. I'd like to know if the instance id down or not.
Any suggestion appreciated.
thanks.

think the instance status not a monitoring metric; there's just instance/uptime available.
(and I have no clue what it would return when it is terminated, possibly worth a try).
but one can check for servers with Uptime Checks and then report the Incident.
and one can get the instance status with gcloud compute instances describe instance01.

Related

How to know the history of what happened in GCP

I had a VM in my account, and out of nowhere, the VM just disappeared. Is there any way to review what was done and why?
Seems to be if you are using free trial You need to explicitly enable billing while during the trial, otherwise your instances will be shut down when the trial runs out. It is not possible to retrieve the instances that have been deleted once. If it has been stopped, it can be retrieved back by simply starting it again.
But During the creation of the Instance you could configure deletion rules to keep the boot disk when the instance is deleted. This can be configured in the submenu “Management, security, disks, networking, sole tenancy” in the Disks section.
Refer to this SO for more information.
You can review what has been done by Audit Logs on GCP. Audit logs help you answer "who did what, where, and when?" within your Google Cloud resources with the same level of transparency as in on-premises environments. This could help you determine what happened to your VM.
To view Audit Logs for Compute Engine, please refer to this doc. To read more about the Compute Engine Audit Logs, you can review this doc.

GCP Uptime Metric is giving unreliable alerts

Trying to get an alert when the GCE VM is in down state by creating Alerting Policy.
Metric: compute.googleapis.com/instance/uptime
Resource : VM instance
And made the configuration that in order to trigger an alert when this condition is absent for 3 minutes.
To simulate this above behavior , I have stopped the VM but it is not triggering an alert , meanwhile data is not visible in graph of the alerting policy
Have attached trigger configuration
None of the metrics are giving reliable alerts when the VM is in stopped state,which are compute.googleapis.com/instance/uptime or uptime of the monitoring agent or cpu utilization metrics until you create alerting poilicy with MQL - Monitoring Query language.
"metrics associated with TERMINATED or DELETED Google Cloud resources are not considered for metric-absence policies. This means you can't use metric-absence policies to test for TERMINATED or DELETED Google Cloud VMs."
https://cloud.google.com/monitoring/alerts/types-of-conditions#metric-absence
So as per the above statement we cannot use metic absence policy for stopped vm - As It goes to terminated state after it stopped for sometime.The reason is , it calculates the instance stop time only when it becomes running state again.
But when you configure the same condition with MQL with the same set of metrics , Metric-absence policies works without any issues.
Sample:
Instead of configuring the condition by selecting resource & metric , go to Query Editor and type the below query for getting the alert when the Development environment VM is not in running state for 3 minutes.
fetch gce_instance
| metric 'compute.googleapis.com/instance/uptime'
| filter (metadata.user_labels.env == 'dev')
| group_by 1m, [value_uptime_aggregate: aggregate(value.uptime)]
| every 1m
| absent_for 180s
Not sure this is the bug or not , but this is limitation when we configure the alerting condition in a traditional way and we can resolve this by leveraging MQL.
Behavior you're describing is unusual.
I reproduced your case and created the exact alerting policy using the same metric compute.googleapis.com/instance/uptime with the same settings. I forwarded all alerts to my e-mail.
Unfortunatelly I wasn't able to reproduce this behavior. After playing with various settings (agregation, absence time) and I was getting alerting emails.
Try maybe setting the alerting policy again. If your goal is just to monitor the state of the VM (responding or no) then you can use any other metrics such as cpu usage which will be absent when the VM is off (or unresponsive).
Finally you can try installing monitoring agent on your VM which will give you more metrics available thus more information on the machine.
Have a look at how to manage alerting policies documentation which may be usefull to you. Additionally this documentation describes alerting policies types and how to choose apropriate one for you use case.
Ultimately try creating another VM and set up alerting policy for it. If that doesn't work your best shot is to go to Google IssueTracker and file a new bug report and it will be handled by the product team - however this may take some (or a lot) of time depending on the issue.

how to get uptime of GCP VM instance

I need to know the uptime of GCP VM instances (windows and Linux both) and based on the time I need to stop the VM. Somehow, I am not getting any simpler way to get the uptime of my all GCP VMs which are like 100 in numbers and will be increasing.
I went through below answer but even there it is not answered, I could not add comment so had to ask new question.
Get vm uptime data from stackdriver-agent in gcp?
In the python code snippet at below link, there is no module available for instance uptime all we have is creating uptime check for service availability.
https://github.com/GoogleCloudPlatform/python-docs-samples
How can I get uptime of all GCP VM instances ?
Assuming that you can adjust process of starting VMs, I think solution below is viable:
When VM is started, add a custom tag with current timestamp (API reference)
Use this tag's value to determine actual instance's uptime
I realize that it sounds overcomplicated, but I don't see any better OS-independent solution.
Update:
The feature you need is already requested in Google's issue tracker. You can check the progress and\or "start" it here: https://issuetracker.google.com/issues/136105125
Note: the issue referenced above is marked as blocked by another one, non-public
Go to GCP console
Select monitoring
Click Uptime checks.
Click Create Uptime check.
for more info check the below document
https://cloud.google.com/monitoring/uptime-checks

Unable to understand GCP bill for Stackdriver Monitoring usage

We have implemented kube-state metrics (by following the steps mentioned in this article section 4.4.1 Install monitoring components) on one of our kubernetes clusters on GCP. So basically it created 3 new deployments node-exporter, prometheus-k8s and kube-state metrics on our cluster. After that, we were able to see all metrics inside Metric Explorer with prefix "external/prometheus/".
In order to check External metrics pricing, we referred to this link. Hence, we calculated the price accordingly but when we received the bill it's a shocking figure. GCP has charged a lot of amount but we haven't added any single metric in dashboard or not set monitoring for anything. From the ingested volume (which is around 1.38GB/day), it looks these monitoring tools do some background job (at specific time it reads some metrics or so) which consumed this volume and we received this bill.
We would like to understand how these kube-state metrics monitoring components work. Will it automatically get metrics data and increase the ingested volume and bill in such way or there is any mis-configuration in its setup?
Any guidance on this would be really appreciated!
Thank you.
By default, when implemented, kube-state-metrics exposes several metrics for events across your cluster:
If you have a number of frequently-updating resources on your cluster, you may find that a lot of data is ingested into these metrics which incurs high costs.
You need to configure what metrics you'd like to expose, as well as consult the documentation for your Kubernetes environment in order to avoid unexpected high costs.

How to extract an instance uptime based on incidents?

On stackdriver, creating an Uptime Check gives you access to the Uptime Dashboard that contains the uptime % of your service:
My problem is that uptime checks are restricted to http/tcp checks. I have other services running and those services report their health in different ways (say, for example, by a specific process running). I have incident policies already set up for this services, so if the service is not running I get notified.
Now I want to be able to look back and know how long the service was down for the last hour. Is there a way to do that?
There's no way to programmatically retrieve alerts at the moment, unfortunately. Many resource types expose uptime as a metric, though (e.g., instance/uptime on GCE instances) - could you pull those and do the math on them? Without knowing what resource types you're using, it's hard to give specific suggestions.
Aaron Sher, Stackdriver engineer