I have a lambda function which does some work. I wanted to create a cloudwatch alarm on it for duration of lambda, i.e. how much time this lambda is taking to run?
I tried to use the following values for the alarm but I am getting a issue with this alarm, probably due to cold start problem. Following are the values I am setting:
Statistic : Average
ComparisonOperator : "GreaterThanThreshold"
Threshold: 1000
EvaluationPeriods: 5
Period: 60
Unit: Milliseconds
The issue I am facing with this is that, it keeps getting into alarm because of the cold start problem probably since it does not get called that often.
What is the best values to set for lambda? How other people are setting alarms on lambda?
Also, if lambda is not called for how much time, then it gets shutdown and a coldstart problem can occur?
Use Blue Matador. The thresholds are dynamic, account for daily variation and cold starts, and use machine learning to detect real anomalies. It does the same thing for all the services that Lambda interacts with (Dynamo, SQS, API gateway, RDS, Kinesis, S3, etc.).
disclaimer: i'm the founder of Blue Matador
If you're looking to do it yourself with Cloudwatch, I would recommend timing out after a certain period of time and returning an error. Then, you can use the Errors metric to tell how many failed over a given time period. It's not a perfect solution, but it could correctly ignore cold starts. We wrote a blog about How to Monitor AWS Lambda with CloudWatch and it includes errors, throttles, and more metrics to watch out for.
Related
I have some ECS tasks running in AWS Fargate which in very rare cases may "die" internally, but will still show as "RUNNING" and not fail and trigger the task to restart.
What I would like to do, if possible is check for the absence of logs, e.g. if logs haven't been written in 30 minutes, trigger a lambda to kill the ECS task which will cause it to start back up.
The health check functionality isn't sufficient.
If this isn't possible, are there any other approaches I could consider?
you can have metric and anomaly detection but it may cost for metric to process logs + alarm may cost too. Would rather do lambda run every 30min which would check if logs are there and then would kill ECS as needed. you can run lambda on interval with cloudwatch events bridge.
Logs are probably sent to cloudwatch logs group from your ECS, if you have static name of the logs group, you can use SDK to describe streams inside the group. This api call will tell you timestamp of the last data in stream.
inside lambda nodejs context aws-sdk v2 is already present, so you can require w/o install. here is doc for v2:
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/CloudWatchLogs.html#describeLogStreams-property
pick to orderBy: "LastEventTime" and to save networking time, set limit from default 50 to 1 limit: 1 and in result you will have lastEventTimestamp
anomaly detection:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html
alarms:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
check pricing for these, there is free tier, so maybe it won't cost you anything, yet it's easy to build up real $ spend with cloudwatch. https://aws.amazon.com/cloudwatch/pricing/
To run lambda on interval:
So I'm trying to setup composite alarms on AWS. So far, I have most of it set up. At the moment, I have a composite alarm set up with 3 alarms. If any 2 of these 3 alarms trigger, then the composite alarm also triggers. This part works fine.
However, I am having trouble with part of my use case. I'd also like to make it so that if one of these alarms within the composite alarm stays in alarm for over a certain period of time, then an alert is also sent out.
Here's an example of the situation:
2 out of the 3 alarms turn on in any time period: Alert should be sent
1 out of the 3 alarms turn on for under a certain time period: Alert should not be sent
1 out of the 3 alarms turn on for over a certain time period: Alert should be sent
I've tried looking into the settings available on the alarms themselves, and there doesn't seem to be an option for what I'm trying to do.
I'm wondering if this would require a lambda function? Is it possible for a lambda function to keep track of how long an alarm has been in alarm?
As talked in the comment section above, I am providing you with a possible solution to your problem. The only blocker is that you can't have different time frame for the alarms, both should be the same.
So you will have (example)- Alarm 1(cpu) if for 15 min it's over 60%. Alarm 2(EFS connections) if for 15 min there are more than 10 connections.
Now the alarm will go off when both the statements are true. Also the alarm will go off when only Alarm 1 goes off.
This is how you are going to make this alarm.
As for testing, it depends on what type of alarms you are making. For example cpu and ram increment methods are widely available on stackoverflow.
Also with aws cli you can change state of an alarm. It's usually for a very small amount of time, maybe 10 seconds.
aws cloudwatch set-alarm-state --alarm-name "myalarm" --state-value ALARM --state-reason "testing purposes"
You need to find a method which can suite your needs better.
I have a AWS Lambda function using an AWS SQS trigger to pull messages, process them with an AWS Comprehend endpoint, and put the output in AWS S3. The AWS Comprehend endpoint has a rate limit which goes up and down throughout the day based off something I can control. The fastest way to process my data, which also optimizes the costs I am paying for the AWS Comprehend endpoint to be up, is to set concurrency high enough that I get throttling errors returned from the api. This however comes with the caveat, that I am paying for more AWS Lambda invocations, the flip side being, that to optimize the costs I am paying for AWS Lambda, I want 0 throttling errors.
Is it possible to set up autoscaling for the concurrency limit of the lambda such that it will increase if it isn't getting any throttling errors, but decrease if it is getting too many?
Very interesting use case.
Let me start by pointing out something that I found out the hard way in an almost 4 hour long call with AWS Tech Support after being puzzled for a couple days.
With SQS acting as a trigger for AWS Lambda, the concurrency cannot go beyond 1K. Even if the concurrency of Lambda is set at a higher limit.
There is now a detailed post on this over at Knowledge Center.
With that out of the way and assuming you are under 1K limit at any given point in time and so only have to use one SQS queue, here is what I feel can be explored:
Either use an existing cloudwatch metric (via Comprehend) or publish a new metric that is indicative of the load that you can handle at any given point in time. you can then use this to set an appropriate concurrency limit for the lambda function. This would ensure that even if you have SQS queue flooded with messages to be processed, lambda picks them up at the rate at which it can actually be processed.
Please Note: This comes out of my own philosophy of being proactive vs being reactive. I would not wait for something to fail to trigger other processes eg invocation errors in this case to adjust concurrency. System failures should be rare and actually raise alarm (if not panic!) rather than being normal that occurs a couple of times a day !
To build up on that, if possible I would suggest that you approach this the other way around i.e. scale Comprehend processing limit and AWS Lambda concurrency based on the messages in the SQS queue (backlog) or a combination of this backlog and the time of the day etc. This way, if every part of your pipeline is a function of the amount of backlog in the Queue, you can be rest assured that you are not spending more than you have at any given point in time.
More importantly, you always have capacity in place should the need arise or something out of normal happens.
I have a few AWS Lambda functions, but the troubleshooting is for one of them. this Lambda function is triggered by message queue, read DynamoDB, process, write DynamoDB. it is called up to 10 requests per second and I have set Lambda provision concurrency. Average Lambda duration is 60 ms which I am very happy with. But every day there are around 10 instances which Lambda function duration is more than 1 second up to 3 second timeout.
I put log in my Lambda, during duration spikes, read/write (getitem/putitem) DynamoDB took more than 1 second. Dynamodb is set to on-demend. it is a very simple table, two columns, ID (auto number) and a json string(about 1KB). I have tried Redis, but weird enough, still had spikes. Lambda is not put in VPC. Dynamo connection has been set to http timeout 500, max retry to 2.
Code to read DynamodDB:
Log for Duration:
When using provisioned concurrency, the Lambda service would keep a set number of the underlying containers "warm" so as to minimize start up time. Since you mention that you intermittently face higher execution durations, refer to the below debugging steps which you can do:
Check the "Concurrent Executions" metric for the Lambda function against the "Duration" metric: If the number of instances of the function executing at a particular time is higher than the set provisioned concurrency, then that would imply that s few of these instances had cold starts causing the higher duration.
Enable X-Ray tracing for the Lambda function and also add X-ray instrumentation to your code: This would give a complete idea of which network call takes up too much time and also give you the cold start "init" duration (if any).
I have an instance in AWS that from time to time it's CPU cross the threshold of 90%.
I have created an alert for this, however I saw that I received one notification only and it was during the first 5 minutes while the CPU was at 100% for 2 hours.
How do I set the metric so I will keep getting notifications all the time?
Cloudwatch does not send notifications continuously if the threshold is breached. Cloudwatch can send a Notification only when the state changes.
Alarms invoke actions for sustained state changes only. CloudWatch alarms do not invoke actions simply because they are in a particular state, the state must have changed and been maintained for a specified number of periods.
Ref: AWS Cloudwatch Documentation
One possible solution that I can think of is to create a Multiple Cloudwatch Alarms with Multiple thresholds.
As the above answer already says it is not triggered again, one thing you can do is changing the alarm conditions to a very large value and then the orginal value and the state change will occur again.