I have a service which I want to know how many errors he throws.
So I've created a metric and an alert based on that metric.
The metric is a counter, and it filters out all the unneeded logs, leaving only the relevant onces.
The alert is using the metric, with an aggregator of type 'count' and aligner of type 'delta' resulting in value '1' when the metric catches any errors.
The condition for the alert is to check if the most recent value is above 0.99.
After an incident from that alert has been fired, it just wont close.
I went to the summary page and it shows that for some reason the condition is still being met (atleast that is what I understand from the red lines that keeps increasing) even though the errors when thrown last time a few hours ago.
In the picture you can see the red lines which indicates the duration of the incident, and below it in the graph you can see three small points where an error was detected. The first one caused the incident to fire.
Any help on how to make the incident resolve?
Thanks!
Was able to fix the problem as soon as I set the aggregator to 'sum' instead of 'count'.
Related
I need help with my monthly report sas code below:
Firstly the code takes too long to run while the data is relatively small. When it completes a message that reads: The contents of log is too large.
Please can you check what the issue which my code?
Meaning of macro variable:
&end_date. = last day of the previous month. for instance 30-Apr-22
&lastest_refrsh_dt. = The latest date the report was published.
once the report is published, we updated the config table with &end_date.
work.schedule_dt: a table that contains the update flag. if all flags are true, we proceed but if update flags are false exit. at the six day of month, if the flag is still false, then email that reads "data not available" is sent.
Normally, that message about the log is due to warnings in the logs over type issues. From what you describe, it is typically due to an issue on date interpretation.
There is nothing in this post to aid in helping beyond that. You need to open the log and find out what the message is. Otherwise, it is speculation on our part.
TL;DR;
What is the proper way to create a metric so that it generates reliable information about the log insights?
What is desired
The current Log insights can be seen similar to the following
However, it becomes easier to analyse these logs using the metrics (mostly because you can have multiple sources of data in the same plot and even perform math operations between them).
Solution according to docs
Allegedly, a log can be converted to a metric filter following a guide like this. However, this approach does not seem to work entirely right (I guess because of the time frames that have to be imposed in the metric plots), providing incorrect information, for example:
Issue with solution
In the previous image I've created a dashboard containing the metric count (the number 7), corresponding to the sum of events each 5 minutes. Also I've added a preview of the log insight corresponding to the information used to create the event.
However, as it can be seen, the number of logs is 4, but the event count displays 7. Changing the time frame in the metric generates other types of issues (e.g., selecting a very small time frame like 1 sec won't retrieve any data, or a slightly smaller time frame will now provide another wrong number: 3, when there are 4 logs, for example).
P.S.
I've also tried converting the log insights to metrics using this lambda function as suggested by Danil Smirnov to no avail, as it seems to generate the same issues.
Sorry for the large image, but it's the best way to to convey what I'm struggling to understand.
This is a simple alert that should trigger when a lambda generates 10 or more errors within 1 hour. Should be very simple, basic stuff.
So why does this alarm go to ALARM-state when the metric doesn't cross the threshold as shown in the image (green boxes). The new(?) bar at the bottom is the state of the alarm.
All relevant settings should be in the screenshot, its just "sum of errors over 1 hour"
I could just adjust the threshold to account for this weirdness, but I'm guess this is not an AWS-error, but a failure to understand from my part. I want to understand.
I did find this text hidden under an information-icon: "The alarming datapoints can appear different to the metric line because of aggregation when displaying at a higher period or because of delayed data that arrived after the alarm was evaluated."
I guess that is the answer, but I still don't like that the visualization doesn't use the same logic as the alarm. Kind of defeats the purpose.
I am going through the documentation of CloudWatch alarms https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html. There are example scenario tables in the Configuring How CloudWatch Alarms Treat Missing Data section. I am having difficulty understanding what is going on.
In the two last rows why behaviour of Missing and Ignore column is different?
First of all, the last 2 rows are still very different. Although they both have 2 missing data points, the very last row's last data point is a 'X', which is Breaching/bad while the second last rows's last data point is a 'O' which is OK/good. Under the setting of treating missing data as "MISSING"/"IGNORE", the second last row is considered an OK, even if it is missing 2 data points. It is reasonable that MISSING/IGNORE settings are more permissive than BREACHING.
And in the last row, MISSING/IGNORE also have different behaviors. This is because IGNORE is more permissive than MISSING as you can see the IGNORE will "Retain current state". This means your alarm under that circumstance will just stay as is until new data points come in so that it break the current data point pattern.
And the rationale behind the behavior of MISSING in last row is that, although we see a single bad data point, we need more data point to determine the next alarm state to be good/bad, or INSUFFICIENT, if no more data points.
From the docs:
No matter what value you set for how to treat missing data, when an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than specified by Evaluation Periods. The exact number of data points it attempts to retrieve depends on the length of the alarm period and whether it is based on a metric with standard resolution or high resolution. The timeframe of the data points that it attempts to retrieve is the evaluation range.
The docs go on to give an example of an alarm with 'EvaluationPeriods' and 'DatapointsToAlarm' set to 3. They state that Cloudwatch chooses the 5 most recent datapoints. Part of my question is, Where are they getting 5? It's not clear from the docs.
The second part of my question is, why have this behavior at all (or at least, why have it by default)? If I set my evaluation period to 3, my Datapoints to Alarm to 3, and tell Cloudwatch to 'TreatMissingData' as 'breaching,' I'm going to expect 3 periods of missing data to trigger an alarm state. This doesn't necessarily happen, as illustrated by an example in the docs.
I had the same questions. As best I can tell, the 5 can be explained if I am thinking about standard collection intervals vs standard resolution correctly. In other words, if we assume a standard collection interval of 5 minutes and a standard 1-minute resolution, then within the 5 minutes of the collection interval, 5 separate data points are collected. The example states you need 3 data points over 3 evaluation periods, which is less than the 5 data points CloudWatch has collected. CloudWatch would then have all the data points it needs within the 5-data-point evaluation range defined by a single collection. As an example, if 4 of the 5 expected data points are missing from the collection, you have one defined data point and thus need 2 more within the evaluation range to reach the three needed for alarm evaluation. These 2 (not the 4 that are actually missing from the collection) are considered the "missing" data points in the documentation - I find this confusing. The tables in the AWS documentation provide examples for how the different treatments of the "missing" 2 data points impact the alarm evaluations.
Regardless of whether this is the correct interpretation, this could be better explained in the documentation.
I also agree that this behavior is unexpected, and the fact that you can't configure it is quite frustrating. However, there does seem to be an easy workaround depending on your use case.
I also wanted the same behavior as you specified; i.e. a missing data point is a breaching data point plain and simple:
If I set my evaluation period to 3, my Datapoints to Alarm to 3, and tell Cloudwatch to 'TreatMissingData' as 'breaching,' I'm going to expect 3 periods of missing data to trigger an alarm state.
I had a use case which is basically like a push-style health monitor. We needed a particular on-premises service to report a "healthy" metric daily to CloudWatch, and an alarm in case this report didn't come through due to network issues or anything disruptive. Semantically, missing data is the same as reporting a metric of value 0 (the "healthy" metric is value 1).
So I was able to use metric math's FILL function to replace every missing data point with 0. Setting a 1-out-of-1, alarm on <1 alarm on this new expression provides exactly the needed behavior without involving any kind of "missing data".