GCP Alert Filters Don't Affect Open Incidents

GCP Alert Filters Don't Affect Open Incidents - google-cloud-platform

I have an alert that I have configured to send email when the sum of executions of cloud functions that have finished in status other than 'error' or 'ok' is above 0 (grouped by the function name).
The way I defined the alert is:
And the secondary aggregator is delta.
The problem is that once the alert is open, it looks like the filters don't matter any more, and the alert stays open because it sees that the cloud function is triggered and finishes with any status (even 'ok' status keeps it open as long as its triggered enough).
ATM the only solution I can think of is to define a log based metric that will count it itself and then the alert will be based on that custom metric instead of on the built in one.
Is there something that I'm missing?
Edit:
Adding another image to show what I think might be the problem:
From the image above we see that the graph wont go down to 0 but will stay at 1, which is not the way other normal incidents work

According to the official documentation:
"Monitoring automatically closes an incident when it observes that the condition is no longer met or when 7 days have passed without an observation that the condition is still being met."
That made me think that there are times where the condition is not relevant to make it close the incident. Which is confirmed here:
"If measurements are missing (for example, if there are no HTTP requests for a couple of minutes), the policy uses the last recorded value to evaluate conditions."
The lack of HTTP requests aren't a reason to close the metric as it keeps using the last recorded value (that triggered the metric).
So, using alerts for Http Requests is fine but you need to close them by yourself. Although I think it would be better to use a custom metric instead if you want them to be disabled automatically.

Related

Reduce alert noise in GCP stackdriver

We have set up alerts in my GCP environments. Basically GCP Stackdriver will raise alerts based on certain parameters which we configured (both at infrastructure level and application level).
The issue is that we are getting too many alerts, if the problem is not resolved quickly enough. For example, if a compute engine is down, we are investigating and still we get alerts. Looking for some help to reduce alert noise so that once we acknowledge an issue, the alert frequency should be reduced till we resolve the issue (maybe once every three hours rather than sending one mail each for every 10 minutes OR after the problem is fixed).

Posting this as an answer for better usability.
When the alert is triggered you will be receiving notifications every 10 minutes or so until you acknowledge the incident.
When you do notifications will stop coming, but the incident will be kept open until you close it.
You can also silence the incident, however it may & will close other incidents that were triggered by the same condition that triggered this one.
You may also have a look at the alerting behavior docs since they may prove useful in such cases.

GCP Alert notification is sending only once

I setup Alert Monitoring for pubsub subscription like below:
I was expecting this to fire every 2 minutes, since the condition is met throughout.
But I got the notification only one time. I also tried with duration 1 minute. Still no luck.
What am I doing wrong here?
Or my understanding of these terms may be wrong?
What I want is:
For every 2 minutes, when count of un-acked message count is > x, trigger an alert.
Note: I just masked the filter field here, which is a subscription_id.

Your current monitoring set up is working as intended since you only have a single time series and a single condition. As per alert notification docs:
You can only receive multiple notifications if any of the following
are true:
All conditions are met:
When all conditions are met, then for each
time series that results in a condition being met, the policy sends a
notification and creates an incident. For example, if you have a
policy with two conditions and each condition is monitoring one time
series, then when the policy is triggered, you receive two
notifications and you see two incidents.
Any condition is met:
The policy sends a notification each time a new
combination of conditions is met. For example, assume that ConditionA
is met, that an incident opens, and that a notification is sent. If
the incident is still open when a subsequent measurement meets both
ConditionA and ConditionB, then another notification is sent.
If you create the policy by using the Google Cloud Console, then the
default behavior is to send a notification when the condition is
met.
Lastly the purpose of "Period" is to just increase the data points in the chart and is not related to triggering a notification multiple times until it is below the threshold. Thus it is not possible send continuous notifications until the monitored data is below the threshold.

GCP Logs-Based Monitoring: Trigger an alert when no logs are received

I have an application that I'm setting up logs-based monitoring for. The application will log whenever it completes a certain task. I want to ensure that the application completes this at least once every 6 hours.
I have tried to replicate this rule by configuring monitoring to fire an alert when the metric stays below 1 for the given amount of time.
Unfortunately, when the logs-based metric doesn't receive any logs, it appears to act that there is "no data" instead of a value of 0.
Is it possible to treat segments when no logs are received as a 0 so that the alert will fire?
Screenshot of my metric graph:
Screenshot of alert definition:
You can see that we receive a log for one time frame, but right afterwards the line disappears and an alert isn't triggered.

Try using absent_for and MQL based Alert.
The absent_for table operation generates a table with two value columns, active and signal. The active column is true when there is data missing from the table input and false otherwise. This is useful for creating a condition query to be used to alert on the absence of inputs.
Example:
fetch gce_instance :: compute.googleapis.com/instance/cpu/usage_time
| absent_for 8h

AWS lambda execution fails only first time I run it with 'customer function error'

I trigger a lambda function via API gateway and everything works perfectly with the one exception that the very first time I trigger it on a given day it fails.
Strangely, the lambda function logs don't show any errors. I get my usual START log statement and then the request and context of the trigger, then after 5s, it ends unexpectedly.
When I look into the API gateway logs this is the error it returns:
Lambda execution failed with status 200 due to customer function error: 2018-12-10T11:00:31.208Z cc233168-fc9n-11fc-a05a-577bb4sd2b2ccc Task timed out after 5.01 seconds.
Has anyone encountered a similar problem? What is customer function error and how may I resolve this?

without knowing much of the background code you are using, i would termed this a Cold Start. Cold start happens for the first request where your function has not be called for a very long time. If you notice error message says "Time Out after 5.01 seconds. which is default set. you can increase a time out.
Alternatively, you could consider reducing the impact of cold starts by reducing the length of cold starts reference :
by authoring your Lambda functions in a language that doesn’t incur a high cold start time — i.e. Node.js, Python, or Go
choose a higher memory setting for functions on the critical path of handling user requests (i.e. anything that the user would have to wait for a response from, including intermediate APIs)
optimizing your function’s dependencies, and package size
You can also explore by putting a cron job through Cloud Watch after every specific interval to call your API through PING

Adding to Yash's answer:
I've only seen Lambda execution failed with status 200 in API Gateway execution logs, though in case it can manifest in other ways: ensure you have execution logging enabled for the endpoint. If you didn't already have it enabled you'll need to wait for the problem to manifest again.
You can verify it's a cold start problem as follows:
In the log entry with the error grab the #logStream value and the timestamp for the event; it'll be a long string of alphanumerics like a4f8115980dc83a511eeedc493a78741
Open the log group for that endpoint's execution log -> find the log stream with the identifier you just grabbed
Narrow the date/time range to a window around the time where the event occurred
If you chose a narrow window and if it's a cold start problem: I would expect the offending request to be the first one in the list. Click the There are older events to load. Load more. at the top of the list.
You should now see a gap of time between the last request received and the offending request.
In my case the error says connection reset by peer which leads me to think it's behaving as though a virtual machine were put to sleep then awoken in the sense that it believes TCP connections it previously had open are still valid.
In the short term the solution we're going with is to implement a retry strategy.
Besides the cold-start problem, there's another potential aspect of this problem: your API Gateway access log format.
Do the following:
Find the access log entries that correspond to the offending request in the execution log.
Is the HTTP status == 502?
502s in the API Gateway access log usually (always?) indicate the Lambda responded with malformed JSON.
The most obvious reason for it returning malformed JSON is a bug in your code. One of the less obvious reasons: a mistake in the access log format.
If you suspect that's the case, look for the following:
Quoted fields that shouldn't be; eg $context.error.messageString
Un-quoted fields that should be. A common idiom is to leave numeric fields un-quoted because it makes insights queries like this work: | filter #status >= 500. As convenient as that is, if the field isn't guaranteed to produce a numeric result then the JSON response will be malformed.
Trailing commas in {} bodies
Here's the documentation for many of the the context variables, though one thing to keep in mind: the context variables that are available differ between the different API Gateway endpoint types (lambda, websocket, etc).

Avoiding INSUFFICIENT DATA in Cloudwatch?

I have alarms set up to tell me when my load balancers are throwing 5xxs using the HTTPCode_Backend_5XX metric with the sum statistic. The issue is that sum registers 0 as no data points, so when no 5xxs are thrown, the alarm is treated as insufficient data. This is especially frustrating, because I have SNS setup to notify me whenever we get too many 5xxs (alarm state) and whenever things go back to normal. Annoyingly, 0 5xxs means we're in INSUFFICIENT DATA status, but 1 5xx means we're in OK status, so 1 5xx triggers everyone getting notified that stuff is OK. Is there any way around this? Ideally, I'd like to just have 0 of anything show up as a zero data point instead of no data at all (insufficient data).

As of March 2017, you can treat missing data as acceptable. This will prevent the alarm from being marked as INSUFFICIENT.
You can also set this in CloudFormation using the TreatMissingData property. For example:
TreatMissingData: notBreaching

We had a similar issue for some of our alarms. You can actually avoid this behaviour with some work, if you really want to deal with the overhead.
What we have done is, instead of sending SNS notifications directly to e-mails, we have created a lambda function and triggered it once we have the notification in the SNS topic.
This way, you will have more control over the actions you can take once the alarms are triggered. As the context will provide you old state value as well.
The good news is, there is already a lambda template to get started.
https://aws.amazon.com/blogs/aws/new-slack-integration-blueprints-for-aws-lambda/
Just pick the one that is designed to send cloudwatch alarms to slack. You can then modify the code as you wish, either dismiss the slack part and just use emails, or keep it with slack. (which is what we did and it works like a charm)

I asked for this in the AWS forums two years ago :-(
https://forums.aws.amazon.com/thread.jspa?threadID=153753&tstart=0
Unfortunately you cannot create notifications based on specific state changes (in your case you want a notification when state changes from ALARM to OK, but not when state changes from INSUFFICIENT to OK). I can only suggest that you also ask for it and hopefully it will eventually be added.
For metrics that are often in the INSUFFICIENT state I generally just create notifications for ALARMS and I don't have notifications on OK for these metrics - if I want to confirm that things are OK I use the AWS mobile app to check on things and see if they have resolved.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js