I have a process that runs once every 24 hours (data pipeline). I want to monitor it for failures, but am having trouble defining an alarm that works adequately.
Since the process runs only once every 24 hours, there can be 1 failure every 24 hours at most.
If I define a short period (e.g. 5 minutes), then the alarm will flip back to OK status after 5 minutes, as there are no more errors.
If I define a period of 24 hours, then the alarm will be stuck in ERROR status until the period passes, even if I re-run the process manually and it succeeds, because "one error within 24 hour period" is still true.
How do I get an alarm on failure, but clear it once the process succeeds?
Related
I am running AWS Glue jobs using PySpark. They have set Timeout (as visible on the screenshot) of 1440 mins, which is 24 hours. Nevertheless, the job continues working over those 24 hours.
When this particular job had been running for over 5 days I stopped it manually (clicking stop icon in column "Run status" in GUI visible on the screenshot). However, since then (it has been over 2 days) it still hasn't stopped - the "Run status" is Stopping, not Stopped.
Additionally, after about 4 hours of running, new logs (column "Logs") in CloudWatch regarding this Job Run stopped appearing (in my PySpark script I have print() statements which regularly and often log extra data). Also, last error log in CloudWatch (column "Error logs") has been written 24 seconds after the date of the newest log in "Logs".
This behaviour continues for multiple jobs.
My questions are:
What could be reasons for Job Runs not obeying set Timeout value? How to fix that?
Why the newest log is from 4 hours since starting the Job Run, while the logs should appear regularly during 24 hours of the (desired) duration of the Job Run?
Why the Job Runs don't stop if I try to stop them manually? How can they be stopped?
Thank you in advance for your advice and hints.
I have an alarm setup in AWS cloudwatch which generates a data point every hour. When its value is greater than or equal to 1, it goes to ALARM state. Following are the settings
On 2nd Nov, it got into ALARM state and then back to OK state in 3 hours. I'm just trying to understand why it took 3 hours to get back to the OK state instead of 1 hour because the metric run every hour.
Here's are the logs which prove that metric transited from ALARM to OK state in 3 hours.
Following is the graph which shows the data point value every hour.
This is probably because alarms are evaluated on longer period then your 1 hour. The period is evaluation range. In your case, the evaluation range could be longer then your 1 hour, thus it takes longer for it to change.
There is also thread about this behavior with extra info on AWS forum:
. Unexplainable delay between Alarm data breach and Alarm state change
I have 2 microservices. Each microservice lambda function timeout is set to 15 minutes but I get a timeout in 5 minutes when I monitor the logs on logz.io, any idea why this is the case. I increased the limit from 5 minutes to 15 minutes but it looks like this has no effect. Please help!
There are chances that your function completes in 5 minutes .Try adding sleep(10*60*1000) i.e 10 minutes in the code and check the total execution time.
All calls made to AWS Lambda must complete execution within 300 seconds. The default timeout is 3 seconds, but you can set the timeout to any value between 1 and 300 seconds.
seems like your function is not reading your set time out.
https://lumigo.io/blog/aws-lambda-timeout-best-practices/
make the change in the serverless.yaml file by adding "timeout" as one of the parameter and set its value. With this every new deployment will keep this value as timeout value.
I cannot get alarms to work reliably for an AWS ec2 instance. I have a g2.xlarge image running and want it to stop when it is not in use, i.e. when average usage falls below 2%.
When I try 1 or 2 periods of 1 hour below 2% it usually works but then when I start it up again it immediately stops itself as it is in an alarm condition. I have tried 12 periods of 5 minutes which allows it to start ok but now it doesn't stop at all despite being in an alarm condition for several hours.
I have tried various options and can't nail down what makes it work and what doesn't. It feels as if sometimes things work and sometimes they don't. Is it buggy or am I missing something?
Here is a screenshot of my setup which has failed to trigger a shutdown...
I was trying to evaluate SNS for a realtime application i am building and needed really fast turn around time < 2 seconds in delivering the message.
Since i am located in APAC region, i have an SNS in Singapore which has a subscriber in Lambda in Us-east-1 location.
Given this setup i ran a code to try to figure out the latency in invoking lambda and do zero processing and just log the time. One might argue you have lambda invocation latency also accounted for in this instance. Which is true. I need Lambda to be invoked and executed and replied to within < 2 seconds.
I sent 23914 messages of which i have an average of 653.520 ms for transport + lambda invocation.
with peaks around 600995 ms (~ 10 minutes ) which is terrible latency for a technology like pubsub.
About 20117 messages got sent and received by lambda in < 653 ms, which means 3797 packets or 15% took more than the average time.
2958 messages or 12.36% took over 1 second to be executed.
379 messages or 1.59% took over 2 seconds to be invoked and executed ( which means 1.6% of my messages cannot be considered realtime and have to be ignored)
82 messages over 10 seconds
64 over 20 seconds
it goes on till ~ 45 seconds, after which the delay is 10 minutes. I have 3 packets with 10 minutes delay.
what bothers me is that about 2% ( if you include the processing time as well )of my messages cannot be processed in realtime for a tiny scale of ~24K messages.
In the scale calculation i am trying to present, requires me to process about 216 billion messages per month. At this scale i am worried that i will not be able to process 4.3 billion messages in realtime.
Given this experiement i am not sure how well SNS would scale. would the #of less than real time messages (read > 2 second delay) be more ? or would it decrease?
Now there might be a tendency to question my internet connection reliability, i re-did this experiment on EC2 and have got very similar results.
Infact the delays in time kind of matched around the same time.
Specific Questions
What are the SLA to SNS performance?
Indirectly : how does these SLA translate to that of AWS Lambda services?
Any reasons as to where these delays might be happening?
Most likely what happened here was throttling on the Lambda function. The default limit for concurrent Lambda invocations is 100. If you sent 20K messages, you likely exceeded that limit, despite the short runtime of the lambda. When your lambda functions are throttled when executing an SNS request, the request goes onto a retry queue and is re-executed up to 3 times, which often occur over a long period of time (up to an hour).
You can see the number of throttles in the CloudWatch metrics for the function (unfortunately, you ran your test before 6 months CloudWatch retention was released).
Last I checked there is no SLA for SNS. SNS is designed to be horizontally scalable and (almost) never drop a message not deliver it quickly.
Update: Since March 2019 there is a SLA for SNS:
https://aws.amazon.com/messaging/sla/
Is there any reason why you can't invoke the lambda from the publisher via the API and store the data within the event passed to the invocation?