AWS CloudWatch: Repeat Alert Notification after 24h of Alertstate - amazon-web-services

i created some AWS CW Alerts which have typically a time periode of 1 hour / 1 Datapoint. By occuring an Alert our Service Team has been notificated. During a "normal" workday, someone cares about it and do the work of resetting some programms etc. But it also happens, no one have time or sense to care and the alert keeps in the alert state.
Now i want to repeat the alert if there wasn't any state-change in the past 24 hours. It is possible? I still does not find the "easy" answer.
Thx!
EDIT:
Added a "daily_occurence_alert" which is controlled by eventrules / time control. An additional alert for each observed Alert combined with an AND serves good.
It is a workaround, not a solution. Hope this feature will be added as a standard in future.

Related

Reduce alert noise in GCP stackdriver

We have set up alerts in my GCP environments. Basically GCP Stackdriver will raise alerts based on certain parameters which we configured (both at infrastructure level and application level).
The issue is that we are getting too many alerts, if the problem is not resolved quickly enough. For example, if a compute engine is down, we are investigating and still we get alerts. Looking for some help to reduce alert noise so that once we acknowledge an issue, the alert frequency should be reduced till we resolve the issue (maybe once every three hours rather than sending one mail each for every 10 minutes OR after the problem is fixed).
Posting this as an answer for better usability.
When the alert is triggered you will be receiving notifications every 10 minutes or so until you acknowledge the incident.
When you do notifications will stop coming, but the incident will be kept open until you close it.
You can also silence the incident, however it may & will close other incidents that were triggered by the same condition that triggered this one.
You may also have a look at the alerting behavior docs since they may prove useful in such cases.

Alert Policies keep incidents open after fix the uptime checks

I'm experiencing since past 9.Jun.21 a problem with the GCP Alert Policies that after the uptime checks recover the OK status, the Alert Policy keeps triggered as active.
The Alerts were configured time ago and the Uptime Checks appear all in green but I have 7 incidents open already since this date.
Is anybody else experiencing the same problem?
Your case seems to be as described below, which if the case, it is an expected behavior that cause your incidents to be closed after 7 days.
Partial metric data: Missing or delayed metric data can result in policies not alerting and incidents not closing. Delays in data arriving from third-party cloud providers can be as high as 30 minutes, with 5-15 minute delays being the most common. A lengthy delay—longer than the duration window—can cause conditions to enter an "unknown" state. When the data finally arrives, Cloud Monitoring might have lost some of the recent history of the conditions. Later inspection of the time-series data might not reveal this problem because there is no evidence of delays once the data arrives
Sometimes happenes that, in the vent of an outage beyond 30 minutes (as mentioned above), your alerting policy enter in such mentioned “unknown” state, resulting in the metrics reporting completely dropped (disappeared), causing that monitoring lost track of the history condition. Once recovered, by default it keeps the last readable value, which in this case, as the metric reporting stopped at all, the tool consider it a null value which is translated to a 0.000.
Such unknown state behavior causes the tool to “observe no changes” even though the metrics are reporting back to a normal state and pace, this enforce a “7 days no observable change” policy that you can read here Managing incidents: Incidents will be automatically closed if the system observed that the condition stopped being met or 7 days passed without an observation that the condition continued to be met.

GCP Alert Filters Don't Affect Open Incidents

I have an alert that I have configured to send email when the sum of executions of cloud functions that have finished in status other than 'error' or 'ok' is above 0 (grouped by the function name).
The way I defined the alert is:
And the secondary aggregator is delta.
The problem is that once the alert is open, it looks like the filters don't matter any more, and the alert stays open because it sees that the cloud function is triggered and finishes with any status (even 'ok' status keeps it open as long as its triggered enough).
ATM the only solution I can think of is to define a log based metric that will count it itself and then the alert will be based on that custom metric instead of on the built in one.
Is there something that I'm missing?
Edit:
Adding another image to show what I think might be the problem:
From the image above we see that the graph wont go down to 0 but will stay at 1, which is not the way other normal incidents work
According to the official documentation:
"Monitoring automatically closes an incident when it observes that the condition is no longer met or when 7 days have passed without an observation that the condition is still being met."
That made me think that there are times where the condition is not relevant to make it close the incident. Which is confirmed here:
"If measurements are missing (for example, if there are no HTTP requests for a couple of minutes), the policy uses the last recorded value to evaluate conditions."
The lack of HTTP requests aren't a reason to close the metric as it keeps using the last recorded value (that triggered the metric).
So, using alerts for Http Requests is fine but you need to close them by yourself. Although I think it would be better to use a custom metric instead if you want them to be disabled automatically.

Google Cloud Monitoring: Add an alert if Publish succeeds and subscribe fails

I want to add an alert on Google Cloud Monitoring such that, for a given topic and a subscription, I want to know if a topic is being published then subscriptions are not being acknowledged at the same or similar rate for a given time frame.
How do we achieve that using Alerts in Google Cloud Monitoring or StackDriver?
I have tried an approach where I have 2 conditions to satisfy:
If publish operations > 0.016/sec for 2 minutes (meaning atleast one
publish per minute)
If subscribe acknowledgments < 0.001/sec for 2 minutes (If no subscribe acknowledgements happening in 2 minutes)
Then, alert.
Whats happening here is, during low load, if there are no publishes happening say for a span of 3 minutes and a publish happens, both conditions 1 and 2 are set to be true and devs are alerted about this as failure.
So, what is the right way of designing such alerts?
If my approach is close to what I want, the next questions that come to my mind are,
Is there a way to say count your two minutes from the instance where
the publish happens to see if acknowledgement condition is
satisfying or not.
Or, is there a way to make the alert to wait for 2-3 minutes to see if the incident resolves, and then send an alert to devs.
Or, is there a way we can count the occurances of these conditions satisfying and then alert only if the occurances are more than 5 or 10 in a span of 15 minutes or something like that.
Sorry for the long post. But, any kind of help is appreciated.
In order to calculate frequency for tasks a time window of 2-3 minutes is used. So if you had 0 tasks for 2 minutes or longer this issue recurs. This is described in documentation about partial metrics. Also, there are workarounds inside this link.
You can try creating your own custom metrics.

Avoiding INSUFFICIENT DATA in Cloudwatch?

I have alarms set up to tell me when my load balancers are throwing 5xxs using the HTTPCode_Backend_5XX metric with the sum statistic. The issue is that sum registers 0 as no data points, so when no 5xxs are thrown, the alarm is treated as insufficient data. This is especially frustrating, because I have SNS setup to notify me whenever we get too many 5xxs (alarm state) and whenever things go back to normal. Annoyingly, 0 5xxs means we're in INSUFFICIENT DATA status, but 1 5xx means we're in OK status, so 1 5xx triggers everyone getting notified that stuff is OK. Is there any way around this? Ideally, I'd like to just have 0 of anything show up as a zero data point instead of no data at all (insufficient data).
As of March 2017, you can treat missing data as acceptable. This will prevent the alarm from being marked as INSUFFICIENT.
You can also set this in CloudFormation using the TreatMissingData property. For example:
TreatMissingData: notBreaching
We had a similar issue for some of our alarms. You can actually avoid this behaviour with some work, if you really want to deal with the overhead.
What we have done is, instead of sending SNS notifications directly to e-mails, we have created a lambda function and triggered it once we have the notification in the SNS topic.
This way, you will have more control over the actions you can take once the alarms are triggered. As the context will provide you old state value as well.
The good news is, there is already a lambda template to get started.
https://aws.amazon.com/blogs/aws/new-slack-integration-blueprints-for-aws-lambda/
Just pick the one that is designed to send cloudwatch alarms to slack. You can then modify the code as you wish, either dismiss the slack part and just use emails, or keep it with slack. (which is what we did and it works like a charm)
I asked for this in the AWS forums two years ago :-(
https://forums.aws.amazon.com/thread.jspa?threadID=153753&tstart=0
Unfortunately you cannot create notifications based on specific state changes (in your case you want a notification when state changes from ALARM to OK, but not when state changes from INSUFFICIENT to OK). I can only suggest that you also ask for it and hopefully it will eventually be added.
For metrics that are often in the INSUFFICIENT state I generally just create notifications for ALARMS and I don't have notifications on OK for these metrics - if I want to confirm that things are OK I use the AWS mobile app to check on things and see if they have resolved.