I'm creating bunch of EBS snapshots as part of AWS Lambda. I need to capture events when these snapshots complete so I can create an ec2 instance based on these.
I could use snapshot waiter but this polls and sometimes snapshot creation can take long time. I don't want Lambda to keep running for a while and plus the maximum time for Lambda seems to be five minutes. I looked at CloudWatch and AWS Config to see if I can capture snapshot events but had no luck.
There is now a new Event when Snapshots are completed in AWS Cloudwatch Events:
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-cloud-watch-events.html
You are correct -- there is no notification event that signifies completion of an EBS Snapshot. Instead, you would need to check the status until the status changes to completed.
You are also correct that AWS Lambda functions can run for a maximum of 5 minutes and having a Lambda function waiting on an external process is not a good architecture.
Instead, you could break-up the architecture:
Have your existing process trigger the EBS Snapshot(s) and then push a message into an SQS queue
Schedule a Lambda function (eg every 5 minutes) to check the SQS queue. If a message exists:
Retrieve details about the instance and snapshot(s) from the message
Check the status of the snapshot(s)
If the status is completed, perform the next step in the process
The down-side is that the scheduled Lambda function will trigger even when there are no messages in the queue. The Lambda function will exit very quickly (cost: 100ms).
The alternative is to run a cron script on an Amazon EC2 instance (or on any computer connected to the Internet). A t2.nano instance is about 15.6c per day, which might be more expensive than a schedule Lambda function. If you already have an instance being used, then there would be no additional cost.
Related
I have some ECS tasks running in AWS Fargate which in very rare cases may "die" internally, but will still show as "RUNNING" and not fail and trigger the task to restart.
What I would like to do, if possible is check for the absence of logs, e.g. if logs haven't been written in 30 minutes, trigger a lambda to kill the ECS task which will cause it to start back up.
The health check functionality isn't sufficient.
If this isn't possible, are there any other approaches I could consider?
you can have metric and anomaly detection but it may cost for metric to process logs + alarm may cost too. Would rather do lambda run every 30min which would check if logs are there and then would kill ECS as needed. you can run lambda on interval with cloudwatch events bridge.
Logs are probably sent to cloudwatch logs group from your ECS, if you have static name of the logs group, you can use SDK to describe streams inside the group. This api call will tell you timestamp of the last data in stream.
inside lambda nodejs context aws-sdk v2 is already present, so you can require w/o install. here is doc for v2:
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/CloudWatchLogs.html#describeLogStreams-property
pick to orderBy: "LastEventTime" and to save networking time, set limit from default 50 to 1 limit: 1 and in result you will have lastEventTimestamp
anomaly detection:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html
alarms:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
check pricing for these, there is free tier, so maybe it won't cost you anything, yet it's easy to build up real $ spend with cloudwatch. https://aws.amazon.com/cloudwatch/pricing/
To run lambda on interval:
In my architecture when I receive a new file on S3 bucket, a lambda function triggers an ECS task.
The problem occurs when I receive multiple files at the same time: the lambda will trigger multiple instance of the same ECS task that acts on the same shared resources.
I want to ensure only 1 instance is running for specific ECS Task, how can I do?
Is there a specific setting that can ensure it?
I tried to query ECS Cluster before run a new instance of the ECS task, but (using AWS Python SDK) I didn't receive any information when the task is in PROVISIONING status, the sdk only return data when the task is in PENDING or RUNNING.
Thank you
I don't think you can control that because your S3 event will trigger new tasks. It will be more difficult to check if the task is already running and you might miss execution if you receive a lot of files.
You should think different to achieve what you want. If you want only one task processing that forget about triggering the ECS task from the S3 event. It might work better if you implement queues. Your S3 event should add the information (via Lambda, maybe?) to an SQS queue.
From there you can have an ECS service doing a SQS long polling and processing one message at a time.
So I have this scenario where an Amazon EC2 instance in an Auto Scaling group will be terminated. My problem is that I don’t want it terminated until it has finished whatever it’s doing.
If I hook up a lambda, the lambda would check a metric, if this metric is > 0 then it needs to wait 60 seconds.
I have this done already, the problem is it may take more than the Max timeout for lambdas of 15 minutes, to finish the jobs it’s processing.
If I read correctly, the lifecycle notification is only sent once, so this lambda won’t work for me.
Is there any other way of doing this?
Here is how I would try to approach this problem (this is a needs a POC, the answer is theoretical):
Create an Auto Scaling Group
Put a lifecycle hook on this ASG like described here, sending notification to Amazon SNS
Create a launch script for instances which will do the following
subscribe to SNS on instance launch and start SNS listener script
SNS listener will wait for instance termination message, do whatever necessary to wait until instance is ready to terminate, including sending heartbeats if termination needs more than 1h and completing lifecycle hook (described here). This should also handle unsubscription from SNS.
I currently have a task at hand to Terminate a long-running EMR cluster after a set period of time (based on some metric). Google Dataproc has this capability in something called "Cluster Scheduled Deletion" Listed here: Cluster Scheduled Deletion
Is this something that is possible on EMR natively? Maybe using Cloudwatch metrics? Or can I write a long-running jar which will sit on the EMR Master node and just poll yarn for some idle time metric and then shut down the cluster after a set period of time?
Edit: For more clarification. I would like some functionality wherein the cluster is terminated based on idle for some x amount of time. e.g. If the cluster has been up for a while but no jobs have been run for say 1 hour and the cluster is just sitting there doing nothing, then I'd like the ability to terminate the cluster.
The easiest method would be used to Amazon EMR Metrics and Dimensions for Amazon CloudWatch. There is an isIdle boolean that "indicates that a cluster is no longer performing work".
You could create a CloudWatch Alarm that says if it is True for more than x minutes, then trigger the alarm. This would send a message to Amazon SNS, which can trigger a Lambda function to shutdown the cluster.
Components:
Amazon CloudWatch Alarm
Amazon SNS queue
AWS Lambda function
Update: This apparently isn't suitable (see comments below).
An alternate method would be:
Use Amazon CloudWatch Events to schedule a Lambda function every x seconds
The Lambda function looks for any clusters with a particular tag that indicates how long to wait until shutdown (eg 40 minutes). If the tag is not present, the cluster remains untouched.
The Lambda function queries the cluster state (somehow -- probably via a Hadoop API call), then:
If the cluster is idle and there is no Idle Since tag, add an Idle Since tag with the current timestamp
If the cluster is idle and it been more than x minutes since the timestamp in the Idle Since tag, terminate the cluster.
If the cluster is not idle, remove the Idle Since tag (if present)
Keeping in mind the clarification that you have provided in your question, there could be 3 possible ways to do that.
1) Using AWS CloudWatch metric isIdle of an EMR cluster. This metric tracks whether a cluster is live, but not currently running tasks. You can set an alarm to fire when the cluster has been idle for a given period of time, such as thirty minutes.
Reference: https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html
2) [Recommended] Using AWS CloudWatch event/rule and AWS Lambda function to check for Idle EMR clusters. You can achieve visibility on the AWS Console level and can easily enable and disable it.
[Recommended] Solution using 2nd Approach
Keeping in mind the need for this, I have developed a small framework to achieve that using the 2nd solution mentioned above. This framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.
You specify the maximum idle time threshold and AWS CloudWatch event/rule triggers an AWS Lambda function that queries all AWS EMR clusters in WAITING state and for each, compares the current time with AWS EMR cluster's ready time in case of no EMR steps added so far or compares the current time with AWS EMR cluster's last step's end time. If the threshold has been compromised, the AWS EMR will be terminated after removing termination protection if enabled. If not, it will skip that AWS EMR cluster.
AWS CloudWatch event/rule will decide how often AWS Lambda function should check for idle AWS EMR clusters.
You can disable the AWS CloudWatch event/rule at any time to disable this framework in a single click without deleting its AWS CloudFormation stack.
AWS Lambda function is using Python 3.7 as its runtime environment.
You can get the code and use it from GitHub here: https://github.com/abdullahkhawer/auto-terminate-idle-emr
Note: Any contributions, improvements, and suggestions to this solution that I developed will be highly appreciated.
3) Some other custom solution based on a Shell that runs against a CRON job on an EMR cluster's master node but you will lose its visibility on the AWS Console level and you may require SSH access as well.
I had to do a similar implementation and just considering Cluster Elapsed time was not solving our problem.
so we came up with a approach to hit the Hadoop API, you can find them here
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API
So here is what we did,
Ask the user who brings up a cluster to add a Tag like "AutoShutDown":"True:BufferMinutes", here "AutoShutDown" is the key and "True:BufferMinutes" is the value of the Tag
Here BufferMinutes is the time in minutes (30, 60 etc.)
create a Lambda to hit the hadoop api of all those clusters configured with step 1 (if the user does not add the Tag then the cluster is untouched) and fetch the end time of the last job that was completed (only if all jobs are either completed / terminated), if any job is still running then do nothing and exit.
now
datetime_difference = (current_time - lastFinished)
if(datetime_difference > requested_time)
{
terminate_cluster
}
Create a cloud watch trigger and add the lambda created as target to it, schedule the trigger to run as required.
Note: Lambda is written in python, so boto3 is used and client will be "emr" same like what abdullahkhawer mentioned in his solution above.
This implementation gives flexibility to the user to choose and reduces a great deal of burden on dev-ops.
I have a business case when an EC2 instance runs out of space, we need to spawn new EBS volume, attach it to EC2 instance and format it.
I have created one cron job which keeps sending disk usage to cloud watch and trying to create one alarm this custom metric.
Now I am not able to find out any information regarding how to spawn an EBS volume when this alarm triggers.
So I would like to know if it is it possible to spawn EBS volume when cloudwatch alarm triggers? If yes, please give some steps or point to the document where I can find this information.
As if now all I have found out is that we can either spawn new instances or send some emails whenever alarm triggers.
You can fire an notification to an SNS topic when the CloudWatch alarm fires, and have a SQS queue as a subscriber to that topic. Then, an EC2 instance consuming that SQS queue can perform the desired change using the AWS CLI or SDKs.