AWS alarm goes to Ok state unexpectedly - amazon-web-services

I have an alarm setup in AWS cloudwatch which generates a data point every hour. When its value is greater than or equal to 1, it goes to ALARM state. Following are the settings
On 2nd Nov, it got into ALARM state and then back to OK state in 3 hours. I'm just trying to understand why it took 3 hours to get back to the OK state instead of 1 hour because the metric run every hour.
Here's are the logs which prove that metric transited from ALARM to OK state in 3 hours.
Following is the graph which shows the data point value every hour.

This is probably because alarms are evaluated on longer period then your 1 hour. The period is evaluation range. In your case, the evaluation range could be longer then your 1 hour, thus it takes longer for it to change.
There is also thread about this behavior with extra info on AWS forum:
. Unexplainable delay between Alarm data breach and Alarm state change

Related

AWS CloudWatch alarm for infrequently executed process

I have a process that runs once every 24 hours (data pipeline). I want to monitor it for failures, but am having trouble defining an alarm that works adequately.
Since the process runs only once every 24 hours, there can be 1 failure every 24 hours at most.
If I define a short period (e.g. 5 minutes), then the alarm will flip back to OK status after 5 minutes, as there are no more errors.
If I define a period of 24 hours, then the alarm will be stuck in ERROR status until the period passes, even if I re-run the process manually and it succeeds, because "one error within 24 hour period" is still true.
How do I get an alarm on failure, but clear it once the process succeeds?

Trigger another lambda after a week of first lambda execution

I am working on a code where Lambda Function 1 (call it, L1) executes on messages from an SQS queue. I want to execute another lambda (call it, L2) exactly a week after L1 completes and want to pass L1's output to L2.
Execution Environment: Java
For my application, we are expecting around 10k requests on L1 per day. And same number of requests for L2.
If it runs for a week, we can have around 70k active executions at peak.
Things that I have tried:
Cloudwatch events with cron: I can schedule a cron with specified time or date which will trigger L2. But I couldn't find way to pass input with scheduled Cloudwatch event.
Cloudwatch events with new rules: At the end of first lambda I can create a new cloudwatch rule with specified time and specified input. But that will create as many rules (for my case, it could be around 10k new cloudwatch rules everyday). Not sure if that is a good practice or even supported.
Step function: There are two types step functions in play today.
Standard: Supports wait for a year, but only supports 25k active executions at any time. Won't scale since my application will already have 70k active executions at the end of first week.
https://docs.aws.amazon.com/step-functions/latest/dg/limits.html
Express: Doesn't have limit on number of active executions but supports max 5 minutes executions. It will time out after that.
https://docs.aws.amazon.com/step-functions/latest/dg/express-limits.html
It would be easy to create a new Cloudwatch Rule with the "week later" Lambda as a target as the last step in the first Lambda. You would set a Rule with a cron that runs 1 time in 1 week. Then, the Target has an input field. In the console it looks like:
You didn't indicate your programming environment but you can do something similar to (psuedo code, based on Java SDK v2):
String lambdaArn = "the one week from today lambda arn";
String ruleArn = client.putRule(PutRuleRequest.builder()
.scheduleExpression("17 20 23 7 *")
.name("myRule")).ruleArn();
Target target = TargetBuilder.builder().arn(lambdaArn).input("{\"message\": \"blah\"}").rule("myRule");
client.putTargets(PutTargetsRequest.builder().targets(target));
This will create a Cloudwatch Event Rule that runs one time, 1 week from today with the input as shown.
Major Edit
With your new requirements (at least 1 week later, 10's of thousands of events) I would not use the method I described above as there are just too many things happening. Instead I would have a database of events that will act as a queue. Either a DynamoDB or RDS database will suffice. At the end of each "primary" Lambda run, insert an event with the date and time of the next run. For example, today, July 18 I would insert July 25. The table would be something like (PostgreSQL syntax):
create table event_queue (
run_time timestamp not null,
lambda_input varchar(8192),
);
create index on event_queue( run_time );
Where the lambda_input column has whatever data you want to pass to the "week later" Lambda. In PostgreSQL you would do something like:
insert into event_queue (run_time, lambda_input)
values ((current_timestamp + interval '1 week'), '{"value":"hello"}');
Every database has something similar to the date/time functions shown or the code to do this isn't terrible.
Now, in CloudWatch create a rule that runs once an hour (the resolution can be tuned). It will trigger a Lambda that "feeds" an SQS queue. The Lambda will query the database:
select * from event_queue where run_time < current_timestamp
and, for each row, put a message into an SQS queue. The last thing it does is delete these "old" messages using the same where clause.
On the other side you have your "week later" Lambdas that are getting events from the SQS queue. These Lambdas are idle until a set of messages are put into the queue. At that time they fire up and empty the queue, doing whatever the "week later" Lambda is supposed to do.
By running the "feeder" Lambda hourly you basically capture everything that is 1 week plus up to 1 hour old. The less often you run it the more work that your "week later" Lambda's have to do and conversely, running every minute will add load to the database but remove it from the week later Lambda.
This should scale well, assuming that the "feeder" Lambda can keep up. 10k transactions / 24 hours is only 416 transactions and the reading of the DB and creation of the messages should be very quick. Even scaling that by 10 to 100k/day is still only ~4000 rows and messages which, again, should be very doable.
Cloudwatch is more for cron jobs. To trigger something at a specific timestamp or after X amount of time I would recommend using Step Functions instead.
You can achieve your use-case by using a State Machine with a Wait State (you can pass tell it how long to wait based on your input) followed by your Lambda Task State. It will be similar to this example.

Next run time for cloudwatch rate expression

I have scheduled Fixed rate of 2 days for an event rule. Where can I find the next run time?
It counts from the date and time you created the CloudWatch event. If your rate expression is rate(1 day) and you created the event at 22:00:00 UTC, then it will run at that time the next day. A rate expression for 2 days will fire two days later at the same time. And similarly a rate expression at 5 minutes will fire every five minutes beginning at the time the CloudWatch event was created.
To verify this, you can create a Lambda function that uploads an object to an S3 bucket, create a CloudWatch event for it, making note of the time you created it, and observe that when the event fires, the modification date of the object is that many days/minutes/hours from the time you created the event.

How CloudWatch deals with counter resets?

I have an alerts system which is currently powered by Prometheus and I need to port it to CloudWatch.
Prometheus is aware of counter resets so I can, let's say, calculate the rate() in the last 24h seamlessly, without handling the counter resets myself.
Is CloudWatch aware of this too?
Rate function is available in CloudWatch Metric Math, defined as:
Returns the rate of change of the metric per second. This is
calculated as the difference between the latest data point value and
the previous data point value, divided by the time difference in
seconds between the two values.
So you would need to modify the way you emit the metric to not reset the counter. A possible workaround could be to increase the number of datapoints to alarm, for example you can configure your alarms so they transition to alarm if 2 or more datapoints are less or equal to (<=) 0, this way you'll avoid to get an alarm when the reset occurs.

CloudWatch Alarm triggers more often than expected

I have a problem with one of my CloudWatch alarms and i cant quite figure out.
I have a metric for one of my LogGroups, that inserts 1 data point, whenever a "Fatal Alarm" is logged within the application.
A fatal alarm was logged this night at 01:44:51. My alarm changed to ALARM state at 01:45 (as expected), but also went into ALARM state at 02:35 and 03:05
I have a bunch of screen dumps with all the information that i believe is necessary for pointing me in the right direction:
Only one Fatal Log:
.
We can also see theres only one fatal log, if we graph the metric:
.
Alarm state changes:
.
Alarm state changed to ALARM 1 (the expected one):
.
Alarm state changed to ALARM 2 (not expected):
.
Alarm state changed to ALARM 3 (not expected):
.
Alarm configuration:
Am i doing some kind of obvious misconfiguration? I'm a bit confused as to what i am doing wrong
Thanks in advance!
Frederik
There is no issue in your configuration as far as I can see. But if you want to see a graph corresponding to your incidents, you should change the period to fifteen minutes or an hour. Note that your alarm period is set to 30 minutes.