How to create an alert policy for unknown custom metric in GCP - google-cloud-platform

Given the following alert policy in GCP (created with terraform)
resource "google_monitoring_alert_policy" "latency_alert_policy" {
display_name = "Latency of 95th percentile more than 1 second"
combiner = "OR"
conditions {
display_name = "Latency of 95th percentile more than 1 second"
condition_threshold {
filter = "metric.type=\"custom.googleapis.com/http/server/requests/p95\" resource.type=\"k8s_pod\""
threshold_value = 1000
duration = "60s"
comparison = "COMPARISON_GT"
aggregations {
alignment_period = "60s"
per_series_aligner= "ALIGN_NEXT_OLDER"
cross_series_reducer= "REDUCE_MAX"
group_by_fields = [
"metric.label.\"uri\"",
"metric.label.\"method\"",
"metric.label.\"status\"",
"metadata.user_labels.\"app.kubernetes.io/name\"",
"metadata.user_labels.\"app.kubernetes.io/component\""
]
}
trigger {
count = 1
percent = 0
}
}
}
}
I get the following this error (which is part of a terraform project also creating the cluster):
Error creating AlertPolicy: googleapi: Error 404: The metric referenced by the provided filter is unknown. Check the metric name and labels.
Now, this is a custom metric (by a Spring Boot app with Micrometer), therefore this metric does not exist when creating infrastructure. Does GCP have to know a metric before creating an alert for it? This would mean that a Spring boot app has to be deployed on a cluster and sending metrics before this policy can be created?
Am I missing something... (like this should not be done in terraform, infrastructure)?

interesting question, the reason for the 404 error is because the resource was not found, there seems to be a preexisting pre-requisite for the descriptor. I would create the metric descriptor first, you can use this as reference, then going forward on creating the alerting policy.
This is an ingenious way you may avoid it. Please comment if it makes sense and if you make it work like this, share it.

For reference (this can be referenced from the alert policy according to terraform doc):
resource "google_monitoring_metric_descriptor" "p95_latency" {
description = ""
display_name = ""
type = "custom.googleapis.com/http/server/requests/p95"
metric_kind = "GAUGE"
value_type = "DOUBLE"
labels {
key = "status"
}
labels {
key = "uri"
}
labels {
key = "exception"
}
labels {
key = "method"
}
labels {
key = "outcome"
}
}

Related

Add environment based Multiple Notification Channel to GCP Alert Policy with Terraform Lookup

I'm trying to add multiple notification channels to a GCP Alert policy with terraform.
My issue is that I need to add different notification channels based on the production environment where they are deployed.
As long as I keep the notification channel unique, I can easily deploy in the following way.
Here is my variables.tf file:
locals {
notification_channel = {
DEV = "projects/[PROJECT_ID]/notificationChannels/[CHANNEL_ID]"
PRD = "projects/[PROJECT_ID]/notificationChannels/[CHANNEL_ID]"
}
}
Here is my main.tf file:
resource "google_monitoring_alert_policy" "alert_policy" {
display_name = "My Alert Policy"
combiner = "OR"
conditions {
display_name = "test condition"
condition_threshold {
filter = "metric.type=\"compute.googleapis.com/instance/disk/write_bytes_count\" AND resource.type=\"gce_instance\""
duration = "60s"
comparison = "COMPARISON_GT"
aggregations {
alignment_period = "60s"
per_series_aligner = "ALIGN_RATE"
}
}
}
user_labels = {
foo = "bar"
}
notification_channels = [lookup(local.notification_channel, terraform.workspace)]
}
My issue here happens when I try to map multiple notification channels instead of one per environment.
Something like:
locals {
notification_channel = {
DEV = ["projects/[PROJECT_ID]/notificationChannels/[CHANNEL_ID]", "projects/[PROJECT_ID]/notificationChannels/[CHANNEL_ID]" ]...
}
}
However, if I try this way, system tells me that Inappropriate value for attribute "notification_channels": element 0: string.
Here's documentation of:
Terraform Lookup function Terraform
GCP Alert Policy
Could you help?
If I understood your question, you actually need only to remove the square brackets:
notification_channels = lookup(local.notification_channel, terraform.workspace)
Since the local variable notification_channel is already a list, you only need to use lookup to fetch the value based on the workspace you are currently in.

GCP Alerting Policy to Alert on KMS Key Deletion Using Terraform

I am trying to alert on KMS Key deletions using terraform.
I have a log based metric, a policy and a notification channel to PagerDuty.
This all works, however, following the alert triggering it soon clears and there seems to be nothing I can do to stop this.
Here is my code:
resource "google_logging_metric" "logging_metric" {
name = "kms-key-pending-deletion"
description = "Logging metric used to alert on scheduled deletions of KMS keys"
filter = "resource.type=cloudkms_cryptokeyversion AND protoPayload.methodName=DestroyCryptoKeyVersion"
metric_descriptor {
metric_kind = "DELTA"
value_type = "INT64"
unit = "1"
display_name = "kms-key-pending-deletion-metric-descriptor"
}
}
resource "google_monitoring_notification_channel" "pagerduty_alerts" {
display_name = "pagerduty-notification-channel"
type = "pagerduty"
sensitive_labels {
service_key = var.token
}
}
resource "google_monitoring_alert_policy" "kms_key_deletion_alert_policy" {
display_name = "kms-key-deletion-alert-policy"
combiner = "OR"
notification_channels = [google_monitoring_notification_channel.pagerduty_alerts.name]
conditions {
display_name = "kms-key-deletion-alert-policy-conditions"
condition_threshold {
comparison = "COMPARISON_GT"
duration = "300s"
filter = "metric.type=\"logging.googleapis.com/user/kms-key-pending-deletion\" AND resource.type=\"global\""
threshold_value = "0"
}
}
documentation {
content = "Runbook: https://blah"
}
}
In the GCP GUI I can disable the option "Notify on incident closure" in the policy and it stops the alert from clearing.
However I cannot set this via terraform.
I have tried setting alert_strategy.auto_close to null and 0s but this did not work:
alert_strategy {
auto_close = "0s"
# auto_close = null
}
How do I keep the alert active and stop it from clearing when building the policy in terraform?
Am I using the correct resource type? - Should I be using cloudkms.cryptoKey.state that are in "DESTROY_SCHEDULED" state somehow?
For others wanting to find the answer to this:
The need to keep an alert open and not allow it to automatically close is missing in the API.
The issue is tracked here: https://issuetracker.google.com/issues/151052441?pli=1

GCP terraform - alerts module based on log metrics

As per subject, I have set up log based metrics for a platform in gcp i.e. firewall, audit, route etc. monitoring.
enter image description here
Now I need to setup alert policies tied to these log based metrics, which is easy enough to do manually in gcp.
enter image description here
However, I need to do it via terraform thus using this module:
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy#nested_alert_strategy
I might be missing something very simple but finding it hard to understand this as the alert strategy is apparently required but yet does not seem to be supported?
I am also a bit confused on which kind of condition I should be using to match my already setup log based metric?
This is my module so far, PS. I have tried using the same filter as I did for setting up the log based metric as well as the name of the log based filter:
resource "google_monitoring_alert_policy" "alert_policy" {
display_name = var.display_name
combiner = "OR"
conditions {
display_name = var.display_name
condition_matched_log {
filter = var.filter
#duration = "600s"
#comparison = "COMPARISON_GT"
#threshold_value = 1
}
}
user_labels = {
foo = "bar"
}
}
var filter is:
resource.type="gce_route" AND (protoPayload.methodName:"compute.routes.delete" OR protoPayload.methodName:"compute.routes.insert")
Got this resolved in the end.
Turns out common issue:
https://issuetracker.google.com/issues/143436657?pli=1
Had to add this to the filter parameter in my terraform module after the metric name - AND resource.type="global"

Terraform Google provider, create log-based alerting policy

I need to create a log-based alerting policy via Terraform Google cloud provider :
https://cloud.google.com/logging/docs/alerting/monitoring-logs#lba
I checked from the Terraform official documentation and i saw 'google_monitoring_alert_policy' resource : https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy
I don't found with this doc how creating a log based alerting policy.
I can create an alerting policy with type 'Metrics' but not with type 'Logs'
I use the latest version of Terraform Google cloud provider : https://registry.terraform.io/providers/hashicorp/google/latest
How can i create a log-based alerting policy with Terraform Google provider please ?
Thanks in advance for your help.
Problem is solved with version 4.7.0 of google provider, which adds condition_matched_log. Here is a working example :
resource "google_monitoring_notification_channel" "email-me" {
display_name = "Email Me"
type = "email"
labels = {
email_address = "me#mycompany.com"
}
}
resource "google_monitoring_alert_policy" "workflows" {
display_name = "Workflows alert policy"
combiner = "OR"
conditions {
display_name = "Error condition"
condition_matched_log {
filter = "resource.type=\"workflows.googleapis.com/Workflow\" severity=ERROR"
}
}
notification_channels = [ google_monitoring_notification_channel.email-me.name ]
alert_strategy {
notification_rate_limit {
period = "300s"
}
}
}
Thanks Guillaume.
Yes it's the way i solved the issue.
Now there is no way to directly create alerting with log type, via Terraform.
The steps to solve this problem :
Create un log based metric with expected filter
Create an alerting policy with type metric based on the previous created log based metric
resource "google_logging_metric" "my_log_metrics" {
project = var.project_id
name = "my-log-metric"
filter = "..."
description = "..."
metric_descriptor {
metric_kind = "..."
value_type = "..."
}
}
resource "google_monitoring_alert_policy" "my_policy" {
project = var.project_id
display_name = "my-policy"
combiner = "OR"
conditions {
display_name = "my-policy"
condition_threshold {
filter = "metric.type=\"logging.googleapis.com/user/my-log-metric\" AND resource.type=\"cloud_composer_environment\""
...
}
}
The format is logging.googleapis.com/user/<user metrics name>
Look at this example (no notification, only the alert policy)
resource "google_monitoring_alert_policy" "alert_policy" {
display_name = "My Alert Policy"
combiner = "OR"
conditions {
display_name = "test condition"
condition_threshold {
filter = "metric.type=\"logging.googleapis.com/user/test-metrics\" AND resource.type=\"cloud_run_revision\""
duration = "600s"
comparison = "COMPARISON_GT"
threshold_value = 1
}
}
user_labels = {
foo = "bar"
}
}

How to create a slack notification channel in Google Cloud Platform with terraform

I'm trying to create a slack notification channel in GCP with terraform. I am able to create a channel with the code below, but it's missing "Team" and "Owner" attributes.
resource "google_monitoring_notification_channel" "default" {
display_name = "Test Slack Channel"
type = "slack"
enabled = "true"
labels = {
"channel_name" = "#testing"
"auth_token" = "<my_slack_app_token>"
}
}
The first channel in the screenshot below was created via GUI and works fine. The second channel was created via terraform and is unable to send notificaitons:
Terraform registry does not mention these attributes, I have tried defining them in labels right after channel_name:
labels = {
"channel_name" = "#testing"
"team" = "<my_team>"
"owner" = "google_cloud_monitoring"
"auth_token" = "<my_slack_app_token>"
}
I got the following error:
Error creating NotificationChannel: googleapi: Error 400: Field "notification_channel.labels['owner']" is not allowed; labels must conform to the channel type's descriptor; permissible label keys for "slack" are: {"auth_token", "channel_name"}
Apparently, only channel_name and auth_token are valid labels.
What am I missing?
Slack needs the sensitive_lables option for tokens. There is an example in the docs
resource "google_monitoring_notification_channel" "default" {
display_name = "Test Slack Channel"
type = "slack"
labels = {
"channel_name" = "#foobar"
}
sensitive_labels {
auth_token = "...."
}
}