Reduce alert noise in GCP stackdriver

Reduce alert noise in GCP stackdriver - google-cloud-platform

We have set up alerts in my GCP environments. Basically GCP Stackdriver will raise alerts based on certain parameters which we configured (both at infrastructure level and application level).
The issue is that we are getting too many alerts, if the problem is not resolved quickly enough. For example, if a compute engine is down, we are investigating and still we get alerts. Looking for some help to reduce alert noise so that once we acknowledge an issue, the alert frequency should be reduced till we resolve the issue (maybe once every three hours rather than sending one mail each for every 10 minutes OR after the problem is fixed).

Posting this as an answer for better usability.
When the alert is triggered you will be receiving notifications every 10 minutes or so until you acknowledge the incident.
When you do notifications will stop coming, but the incident will be kept open until you close it.
You can also silence the incident, however it may & will close other incidents that were triggered by the same condition that triggered this one.
You may also have a look at the alerting behavior docs since they may prove useful in such cases.

Related

Cloud Run: 429: The request was aborted because there was no available instance

We (as a company) experience large spikes every day. We use Pub/Sub -> Cloud Run combination.
The issue we experience is that when high traffic hits, Pub/Sub tries to push messages to Cloud/Run all at the same time without any flow control. The result?
429: The request was aborted because there was no available instance.
Although this is marked as a warning, every 4xx HTTP response results in the message retry delivery.
Messages, therefore, come back to the queue and wait. If a message repeats this process and the instances are still taken, Cloud Run returns 429 again, and the message is sent back to the queue. This process repeats x times (depends on what value we set in Maximum delivery attempts). After that, the message goes to the dead-letter queue.
We want to avoid this and ideally don't get any 429, so the message won't travel back and forth, and it won't end up in the dead-letter subscription because it is not one of the application errors we want to keep there, but rather a warning caused by Pub/Sub not controlling the flow and coordinating with Cloud Run.
Neither Pub/Sub nor a push subscription (which is required to use for Cloud Run) have any flow control feature.
Is there any way to control how many messages are sent to Cloud Run to avoid getting the 429 response? And also, why does Pub/Sub even try to deliver when it is obvious that Cloud Run hit the limit of instances. The best would be to keep the messages in a queue until the instances free up.
Most of the answers would probably suggest increasing the limit of instances. We already set 1000. This would not be scalable because even if we set the limit to 1500 and a huge spike comes, we would pass the limit and get the 429 messages again.
The only option I can think of is some flow control. So far, we have read about Cloud Tasks, but we are not sure if this can help us. Ideally, we don't want to introduce any new service, but if necessary, we will do.
Thank you for all your tips and time! :)

Alert Policies keep incidents open after fix the uptime checks

I'm experiencing since past 9.Jun.21 a problem with the GCP Alert Policies that after the uptime checks recover the OK status, the Alert Policy keeps triggered as active.
The Alerts were configured time ago and the Uptime Checks appear all in green but I have 7 incidents open already since this date.
Is anybody else experiencing the same problem?

Your case seems to be as described below, which if the case, it is an expected behavior that cause your incidents to be closed after 7 days.
Partial metric data: Missing or delayed metric data can result in policies not alerting and incidents not closing. Delays in data arriving from third-party cloud providers can be as high as 30 minutes, with 5-15 minute delays being the most common. A lengthy delay—longer than the duration window—can cause conditions to enter an "unknown" state. When the data finally arrives, Cloud Monitoring might have lost some of the recent history of the conditions. Later inspection of the time-series data might not reveal this problem because there is no evidence of delays once the data arrives
Sometimes happenes that, in the vent of an outage beyond 30 minutes (as mentioned above), your alerting policy enter in such mentioned “unknown” state, resulting in the metrics reporting completely dropped (disappeared), causing that monitoring lost track of the history condition. Once recovered, by default it keeps the last readable value, which in this case, as the metric reporting stopped at all, the tool consider it a null value which is translated to a 0.000.
Such unknown state behavior causes the tool to “observe no changes” even though the metrics are reporting back to a normal state and pace, this enforce a “7 days no observable change” policy that you can read here Managing incidents: Incidents will be automatically closed if the system observed that the condition stopped being met or 7 days passed without an observation that the condition continued to be met.

AWS CloudWatch: Repeat Alert Notification after 24h of Alertstate

i created some AWS CW Alerts which have typically a time periode of 1 hour / 1 Datapoint. By occuring an Alert our Service Team has been notificated. During a "normal" workday, someone cares about it and do the work of resetting some programms etc. But it also happens, no one have time or sense to care and the alert keeps in the alert state.
Now i want to repeat the alert if there wasn't any state-change in the past 24 hours. It is possible? I still does not find the "easy" answer.
Thx!
EDIT:
Added a "daily_occurence_alert" which is controlled by eventrules / time control. An additional alert for each observed Alert combined with an AND serves good.
It is a workaround, not a solution. Hope this feature will be added as a standard in future.

Avoiding INSUFFICIENT DATA in Cloudwatch?

I have alarms set up to tell me when my load balancers are throwing 5xxs using the HTTPCode_Backend_5XX metric with the sum statistic. The issue is that sum registers 0 as no data points, so when no 5xxs are thrown, the alarm is treated as insufficient data. This is especially frustrating, because I have SNS setup to notify me whenever we get too many 5xxs (alarm state) and whenever things go back to normal. Annoyingly, 0 5xxs means we're in INSUFFICIENT DATA status, but 1 5xx means we're in OK status, so 1 5xx triggers everyone getting notified that stuff is OK. Is there any way around this? Ideally, I'd like to just have 0 of anything show up as a zero data point instead of no data at all (insufficient data).

As of March 2017, you can treat missing data as acceptable. This will prevent the alarm from being marked as INSUFFICIENT.
You can also set this in CloudFormation using the TreatMissingData property. For example:
TreatMissingData: notBreaching

We had a similar issue for some of our alarms. You can actually avoid this behaviour with some work, if you really want to deal with the overhead.
What we have done is, instead of sending SNS notifications directly to e-mails, we have created a lambda function and triggered it once we have the notification in the SNS topic.
This way, you will have more control over the actions you can take once the alarms are triggered. As the context will provide you old state value as well.
The good news is, there is already a lambda template to get started.
https://aws.amazon.com/blogs/aws/new-slack-integration-blueprints-for-aws-lambda/
Just pick the one that is designed to send cloudwatch alarms to slack. You can then modify the code as you wish, either dismiss the slack part and just use emails, or keep it with slack. (which is what we did and it works like a charm)

I asked for this in the AWS forums two years ago :-(
https://forums.aws.amazon.com/thread.jspa?threadID=153753&tstart=0
Unfortunately you cannot create notifications based on specific state changes (in your case you want a notification when state changes from ALARM to OK, but not when state changes from INSUFFICIENT to OK). I can only suggest that you also ask for it and hopefully it will eventually be added.
For metrics that are often in the INSUFFICIENT state I generally just create notifications for ALARMS and I don't have notifications on OK for these metrics - if I want to confirm that things are OK I use the AWS mobile app to check on things and see if they have resolved.

Appfabric Cache Perfmon Errors

We have a critical system that is highly dependent on Appfabric Caching. The setup we use is three nodes which serves around 2000 simultaneous connections and 150-200 requests/second.
Configurations are the default ones. We receives maybe 5-10 "ErrorCode:SubStatus" each day which is unacceptable.
I have added some performance counters but I can't see anything weird except that we sometimes see values on "Total Failure Exceptions / sec" and "Total Failure Exceptions" is increasing but one 2-3 times a day.
I would like to see what these errors comes from but I can't find them in any logs in the Event Viewer (enabled them all according to documentation). Does anyone know if these errorc could be logged somewhere and/or if it possible to seem them in any other way?

We receives maybe 5-10 "ErrorCode:SubStatus" each day which is
unacceptable.
Between 5 or 10 errors per day, with 150 requests/sec per day ?. It's quite anecdotic. Your cache client have to always handle properly caching errors. A network failure can always occurs.
5-10 "ErrorCode:SubStatus" is quite obsur. There are more than 50 error codes in AppFabric Caching. Try to get exactly these error codes. See full list here.
would like to see what these errors comes from but I can't find them
in any logs in the Event Viewer (enabled them all according to
documentation). Does anyone know if these errorc could be logged
somewhere and/or if it possible to seem them in any other way?
The only documentation available is here. The event viewer is useful to regularly monitor the health of the cache cluster. However, when troubleshooting an error, it is possible to get an even more detailed log of the cache cluster activities. I'm not sure, this will help you a lot because it's sometimes too specific.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js