Alert Policies keep incidents open after fix the uptime checks

Alert Policies keep incidents open after fix the uptime checks - google-cloud-platform

I'm experiencing since past 9.Jun.21 a problem with the GCP Alert Policies that after the uptime checks recover the OK status, the Alert Policy keeps triggered as active.
The Alerts were configured time ago and the Uptime Checks appear all in green but I have 7 incidents open already since this date.
Is anybody else experiencing the same problem?

Your case seems to be as described below, which if the case, it is an expected behavior that cause your incidents to be closed after 7 days.
Partial metric data: Missing or delayed metric data can result in policies not alerting and incidents not closing. Delays in data arriving from third-party cloud providers can be as high as 30 minutes, with 5-15 minute delays being the most common. A lengthy delay—longer than the duration window—can cause conditions to enter an "unknown" state. When the data finally arrives, Cloud Monitoring might have lost some of the recent history of the conditions. Later inspection of the time-series data might not reveal this problem because there is no evidence of delays once the data arrives
Sometimes happenes that, in the vent of an outage beyond 30 minutes (as mentioned above), your alerting policy enter in such mentioned “unknown” state, resulting in the metrics reporting completely dropped (disappeared), causing that monitoring lost track of the history condition. Once recovered, by default it keeps the last readable value, which in this case, as the metric reporting stopped at all, the tool consider it a null value which is translated to a 0.000.
Such unknown state behavior causes the tool to “observe no changes” even though the metrics are reporting back to a normal state and pace, this enforce a “7 days no observable change” policy that you can read here Managing incidents: Incidents will be automatically closed if the system observed that the condition stopped being met or 7 days passed without an observation that the condition continued to be met.

Related

Reduce alert noise in GCP stackdriver

We have set up alerts in my GCP environments. Basically GCP Stackdriver will raise alerts based on certain parameters which we configured (both at infrastructure level and application level).
The issue is that we are getting too many alerts, if the problem is not resolved quickly enough. For example, if a compute engine is down, we are investigating and still we get alerts. Looking for some help to reduce alert noise so that once we acknowledge an issue, the alert frequency should be reduced till we resolve the issue (maybe once every three hours rather than sending one mail each for every 10 minutes OR after the problem is fixed).

Posting this as an answer for better usability.
When the alert is triggered you will be receiving notifications every 10 minutes or so until you acknowledge the incident.
When you do notifications will stop coming, but the incident will be kept open until you close it.
You can also silence the incident, however it may & will close other incidents that were triggered by the same condition that triggered this one.
You may also have a look at the alerting behavior docs since they may prove useful in such cases.

AWS CloudWatch: Repeat Alert Notification after 24h of Alertstate

i created some AWS CW Alerts which have typically a time periode of 1 hour / 1 Datapoint. By occuring an Alert our Service Team has been notificated. During a "normal" workday, someone cares about it and do the work of resetting some programms etc. But it also happens, no one have time or sense to care and the alert keeps in the alert state.
Now i want to repeat the alert if there wasn't any state-change in the past 24 hours. It is possible? I still does not find the "easy" answer.
Thx!
EDIT:
Added a "daily_occurence_alert" which is controlled by eventrules / time control. An additional alert for each observed Alert combined with an AND serves good.
It is a workaround, not a solution. Hope this feature will be added as a standard in future.

GCP Alert Filters Don't Affect Open Incidents

I have an alert that I have configured to send email when the sum of executions of cloud functions that have finished in status other than 'error' or 'ok' is above 0 (grouped by the function name).
The way I defined the alert is:
And the secondary aggregator is delta.
The problem is that once the alert is open, it looks like the filters don't matter any more, and the alert stays open because it sees that the cloud function is triggered and finishes with any status (even 'ok' status keeps it open as long as its triggered enough).
ATM the only solution I can think of is to define a log based metric that will count it itself and then the alert will be based on that custom metric instead of on the built in one.
Is there something that I'm missing?
Edit:
Adding another image to show what I think might be the problem:
From the image above we see that the graph wont go down to 0 but will stay at 1, which is not the way other normal incidents work

According to the official documentation:
"Monitoring automatically closes an incident when it observes that the condition is no longer met or when 7 days have passed without an observation that the condition is still being met."
That made me think that there are times where the condition is not relevant to make it close the incident. Which is confirmed here:
"If measurements are missing (for example, if there are no HTTP requests for a couple of minutes), the policy uses the last recorded value to evaluate conditions."
The lack of HTTP requests aren't a reason to close the metric as it keeps using the last recorded value (that triggered the metric).
So, using alerts for Http Requests is fine but you need to close them by yourself. Although I think it would be better to use a custom metric instead if you want them to be disabled automatically.

AWS Lambda function that has been working for weeks, one day timed out for no apparent reason. Ideas?

I wrote a simple lambda function (in python 3.7) that runs once a day, which keeps my Glue data catalog updated when new partitions are created. It works like this:
Object creation in a specific S3 location triggers the function asynchronously
From the event, lambda extracts the key (e.g.: s3://my-bucket/path/to/object/)
Through AWS SDK, lambda asks glue if the partition already exists
If not, creates the new partition. If yes, terminates the process.
Also, the function has 3 print statements:
one at the very beginning, saying it started the execution
one in the middle, which says if the partition exists or not
one at the end, upon successful execution.
This function has an average execution time of 460ms per invocation, with 128MB RAM allocated, and it cannot have more than about 12 concurrent executions (as 12 is the maximum amount of new partitions that can be generated daily). There are no other lambda functions running at the same time that may steal concurrency capacity. Also, just to be sure, I have set the timeout limit to be 10 seconds.
It has been working flawlessly for weeks, except this morning, 2 of the executions timed out after reaching the 10 seconds limit, which is very odd given it's 20 times larger than the avg. duration.
What surprises me the most, is that in one case only the 1st print statement got logged in CloudWatch, and in the other case, not even that one, as if the function got called but never actually started the process.
I could not figure out what may have caused this. Any idea or suggestion is much appreciated.

May be AWS had a problem with their services, I got the same issue.
Not sure it can help. You can check at:
https://status.aws.amazon.com
[CloudFront High Error Rate]
4:28 PM PDT We are investigating elevated error rates and elevated
latency in multiple edge locations. 5:08 PM PDT We can confirm
elevated error rates and high latency accessing content from multiple
Edge Locations, which is also contributing to longer than usual
propagation times for changes to CloudFront configurations. We have
identified the root cause and continue to work toward resolution. 5:54
PM PDT We are beginning to see recovery for the elevated error rates
and high latency accessing content from multiple Edge Locations. Error
rates have recovered for all locations except for Europe.
Additionally, we continue to work toward recovery for the increased
delays in propagating configuration changes to Cloudfront
Distributions. 6:21 PM PDT Starting 3:18 PM PDT, we experienced
elevated error rates and high latency accessing content from multiple
Edge Locations. The elevated error rates and elevated latency
accessing content were fully recovered at 5:48 PM PDT. During this
time, customers may also have experienced longer than usual change
propagation delays for CloudFront configurations and invalidations.
The backlog of CloudFront configuration changes and invalidations were
fully processed by 6:14 PM PDT. All issues have been fully resolved
and the system is operating normally

Avoiding INSUFFICIENT DATA in Cloudwatch?

I have alarms set up to tell me when my load balancers are throwing 5xxs using the HTTPCode_Backend_5XX metric with the sum statistic. The issue is that sum registers 0 as no data points, so when no 5xxs are thrown, the alarm is treated as insufficient data. This is especially frustrating, because I have SNS setup to notify me whenever we get too many 5xxs (alarm state) and whenever things go back to normal. Annoyingly, 0 5xxs means we're in INSUFFICIENT DATA status, but 1 5xx means we're in OK status, so 1 5xx triggers everyone getting notified that stuff is OK. Is there any way around this? Ideally, I'd like to just have 0 of anything show up as a zero data point instead of no data at all (insufficient data).

As of March 2017, you can treat missing data as acceptable. This will prevent the alarm from being marked as INSUFFICIENT.
You can also set this in CloudFormation using the TreatMissingData property. For example:
TreatMissingData: notBreaching

We had a similar issue for some of our alarms. You can actually avoid this behaviour with some work, if you really want to deal with the overhead.
What we have done is, instead of sending SNS notifications directly to e-mails, we have created a lambda function and triggered it once we have the notification in the SNS topic.
This way, you will have more control over the actions you can take once the alarms are triggered. As the context will provide you old state value as well.
The good news is, there is already a lambda template to get started.
https://aws.amazon.com/blogs/aws/new-slack-integration-blueprints-for-aws-lambda/
Just pick the one that is designed to send cloudwatch alarms to slack. You can then modify the code as you wish, either dismiss the slack part and just use emails, or keep it with slack. (which is what we did and it works like a charm)

I asked for this in the AWS forums two years ago :-(
https://forums.aws.amazon.com/thread.jspa?threadID=153753&tstart=0
Unfortunately you cannot create notifications based on specific state changes (in your case you want a notification when state changes from ALARM to OK, but not when state changes from INSUFFICIENT to OK). I can only suggest that you also ask for it and hopefully it will eventually be added.
For metrics that are often in the INSUFFICIENT state I generally just create notifications for ALARMS and I don't have notifications on OK for these metrics - if I want to confirm that things are OK I use the AWS mobile app to check on things and see if they have resolved.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Alert Policies keep incidents open after fix the uptime checks - google-cloud-platform

Related

Reduce alert noise in GCP stackdriver

AWS CloudWatch: Repeat Alert Notification after 24h of Alertstate

GCP Alert Filters Don't Affect Open Incidents

AWS Lambda function that has been working for weeks, one day timed out for no apparent reason. Ideas?

Avoiding INSUFFICIENT DATA in Cloudwatch?

Categories

Resources