Stackdriver Alerts for "Decreases By" Condition Misfiring - google-cloud-platform

I have a custom metric being logged to Stackdriver from a service running in GKE. This custom metric is sort of a load factor for the service. I want to get notified if this load drops by 10% over 5 minutes. This seems pretty straghtforward via the UI:
However, when I set up this alert I begin receiving endless notifications such as:
ALRT [alert name] on [project-name] decreasing by -0.116%
ALRT [alert name] on [project-name] decreasing by 0.207%
...
One alert for each trend line as configured, but each "decreases by" level is well below the 10% threshold that I've set. I have 26 instances of this service and this is resulting in my phone getting blown up with texts every few minutes.
I've also tried setting up conditions for individual series in this metric and the same error occurs: alerts are sent when the change is < 10%
What's the correct method for configuring the "decreases by" condition?

try to change the Condition triggers if to Percent of time series violates. I have no clue, but would assume that else Number of time series violates might also trigger the alert.

Related

GCP alert: Only alerting when threshold is violated for multiple measurement periods

In GCP, I have an alerting policy on database CPU and memory usage. For example, if CPU is over 50% over a 1m period, the alert fires.
The alert is kind of noisy. With other systems, I've been able to alert only if the threshold is violated multiple times, e.g.
If the threshold is violated for 2 consecutive minutes.
If over a 5 minute period, the threshold is violated in 3 of those minutes.
(Note: I don't want to simply change my alignment period to 2 minutes.)
There are a couple things I've seen in the GCP alert configuration that might help here:
Change the "trigger"
UI: "Alert trigger: Number of time series violates", "Minimum number of time series in violation: 2".
JSON: "trigger": {"count": 2}
Change the "retest window"
UI: "Advanced Options" → "Retest window: 2m"
JSON: "duration": "120s"
But I can't figure out exactly how these work. Can these be used to achieve the goal?
The restest window option is usefull in the scenario i think, i have similar requirement that i have set it up in GCP alert policy for db CPU uttlisation breaches 70 % for 5 mins rolling time . if the alert gets clear in 5mins it wont alert but it reappears again for more than 10mins ,it can trigger alert.
I have setup in restest window of time limit 10mins.

Metric for Number of unacknowledged messages older than 20 minutes

I am trying to set up alerts on pubsub in gcp that monitor the number of old messages in a queue. Specifically the number of unacknowledged messages older than 20 minutes.
I want an alert that because number of unacknowledged messages cloud shoot high on sudden push of hugh number of messages. And using only the oldest unacknowledged message will run the alert for outlier messages that might stuck in the queue (ex bad formatted messages etc..)
I've tried to combine both metrics but could not know how to filter on one of them.
fetch pubsub_subscription |
{
t_0: metric 'pubsub.googleapis.com/subscription/num_undelivered_messages';
t_1: metric 'pubsub.googleapis.com/subscription/oldest_unacked_message_age'
}
| outer_join 0 # how to filter now on oldest_unacked_message_age > 20 minutes and select num_undelivered_messages
Also I think this won't work as my understanding of cloud pubsub metrics because each metric is a single time series number. It does not have information about individual messages (correct me if I am wrong).
Also I've tried to look for a metic that have them both but can't find one as well.
You can deploy an alert of undelivered messages in Google Cloud Monitoring. You will find the Pub/Sub subscription resource type, and then you can set a filter based on the response_code. Also, you can create a new chart based on your needs.

GCP Alert notification is sending only once

I setup Alert Monitoring for pubsub subscription like below:
I was expecting this to fire every 2 minutes, since the condition is met throughout.
But I got the notification only one time. I also tried with duration 1 minute. Still no luck.
What am I doing wrong here?
Or my understanding of these terms may be wrong?
What I want is:
For every 2 minutes, when count of un-acked message count is > x, trigger an alert.
Note: I just masked the filter field here, which is a subscription_id.
Your current monitoring set up is working as intended since you only have a single time series and a single condition. As per alert notification docs:
You can only receive multiple notifications if any of the following
are true:
All conditions are met:
When all conditions are met, then for each
time series that results in a condition being met, the policy sends a
notification and creates an incident. For example, if you have a
policy with two conditions and each condition is monitoring one time
series, then when the policy is triggered, you receive two
notifications and you see two incidents.
Any condition is met:
The policy sends a notification each time a new
combination of conditions is met. For example, assume that ConditionA
is met, that an incident opens, and that a notification is sent. If
the incident is still open when a subsequent measurement meets both
ConditionA and ConditionB, then another notification is sent.
If you create the policy by using the Google Cloud Console, then the
default behavior is to send a notification when the condition is
met.
Lastly the purpose of "Period" is to just increase the data points in the chart and is not related to triggering a notification multiple times until it is below the threshold. Thus it is not possible send continuous notifications until the monitored data is below the threshold.

GCP Alert Filters Don't Affect Open Incidents

I have an alert that I have configured to send email when the sum of executions of cloud functions that have finished in status other than 'error' or 'ok' is above 0 (grouped by the function name).
The way I defined the alert is:
And the secondary aggregator is delta.
The problem is that once the alert is open, it looks like the filters don't matter any more, and the alert stays open because it sees that the cloud function is triggered and finishes with any status (even 'ok' status keeps it open as long as its triggered enough).
ATM the only solution I can think of is to define a log based metric that will count it itself and then the alert will be based on that custom metric instead of on the built in one.
Is there something that I'm missing?
Edit:
Adding another image to show what I think might be the problem:
From the image above we see that the graph wont go down to 0 but will stay at 1, which is not the way other normal incidents work
According to the official documentation:
"Monitoring automatically closes an incident when it observes that the condition is no longer met or when 7 days have passed without an observation that the condition is still being met."
That made me think that there are times where the condition is not relevant to make it close the incident. Which is confirmed here:
"If measurements are missing (for example, if there are no HTTP requests for a couple of minutes), the policy uses the last recorded value to evaluate conditions."
The lack of HTTP requests aren't a reason to close the metric as it keeps using the last recorded value (that triggered the metric).
So, using alerts for Http Requests is fine but you need to close them by yourself. Although I think it would be better to use a custom metric instead if you want them to be disabled automatically.

GCP Logs-Based Monitoring: Trigger an alert when no logs are received

I have an application that I'm setting up logs-based monitoring for. The application will log whenever it completes a certain task. I want to ensure that the application completes this at least once every 6 hours.
I have tried to replicate this rule by configuring monitoring to fire an alert when the metric stays below 1 for the given amount of time.
Unfortunately, when the logs-based metric doesn't receive any logs, it appears to act that there is "no data" instead of a value of 0.
Is it possible to treat segments when no logs are received as a 0 so that the alert will fire?
Screenshot of my metric graph:
Screenshot of alert definition:
You can see that we receive a log for one time frame, but right afterwards the line disappears and an alert isn't triggered.
Try using absent_for and MQL based Alert.
The absent_for table operation generates a table with two value columns, active and signal. The active column is true when there is data missing from the table input and false otherwise. This is useful for creating a condition query to be used to alert on the absence of inputs.
Example:
fetch gce_instance :: compute.googleapis.com/instance/cpu/usage_time
| absent_for 8h