launching AND terminating EMR cluster with boto3 on AWS Lambda - amazon-web-services

My case is the following. I want to launch a cluster during working hours and terminate it after 18:00 and weekends. The clusters will be used for a datascience project. Years ago we would use a boring crontab for this, but these days i prefer to do this with a lambda function.
In boto3 i can launch a cluster (thanks to Jose Quinteiro) and this post describes it very well How to launch and configure an EMR cluster using boto
How can i terminate a cluster in boto3 in the same lambda function as where i start it?

Using AWS CloudWatch event/rule and AWS Lambda function to check for Idle EMR clusters, you complete your goal. You achieve visibility on the AWS Console level and can easily enable and disable it.
Keeping in mind the need for this, I have developed a small framework to achieve that using the 2nd solution mentioned above. This framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.
You specify the maximum idle time threshold and AWS CloudWatch event/rule triggers an AWS Lambda function that queries all AWS EMR clusters in WAITING state and for each, compares the current time with AWS EMR cluster's ready time in case of no EMR steps added so far or compares the current time with AWS EMR cluster's last step's end time. If the threshold has been compromised, the AWS EMR will be terminated after removing termination protection if enabled. If not, it will skip that AWS EMR cluster.
AWS CloudWatch event/rule will decide how often AWS Lambda function should check for idle AWS EMR clusters.
You can disable the AWS CloudWatch event/rule at any time to disable this framework in a single click without deleting its AWS CloudFormation stack.
AWS Lambda function is using Python 3.7 as its runtime environment.
In your case, while creating the stack, you can specify your required Cron expression and maximum idle EMR cluster threshold in minutes to achieve this.
You can get the code and use it from GitHub here: https://github.com/abdullahkhawer/auto-terminate-idle-emr
Any contributions, improvements and suggestions to this solution will be highly appreciated. :)

You can terminate the cluster using boto3 by using
emr_client = boto3.client('emr')
emr_client.terminate_job_flows(JobFlowIds=[#replace it with cluster Id you want it to close ])

You could create a scheduled event in cloudwatch that triggers the lambda you are using.
Scheduled events use Cron expressions so you will be able to apply the same logic. Once your function is triggered you will need to determine that it is a shutdown trigger from the event input.

Related

How to use an AWS Restore job to a trigger Lambda function

I have a DocumentDB cluster backed up using AWS Backup. When I restore it, it just creates a cluster with no instances and the cluster uses the default security group of the VPC.
I could not find any solution to fix this as part of the restore job. So, I am using a lambda function that uses boto3 to update the security group and add instances to the cluster.
Now is it possible to trigger the Lambda function automatically when the restore job is completed?
When your Backup job finishes, you can capture an event using EventBridge and then trigger your Lambda off of that.
This blog post from AWS covers triggering a Lambda off the back of an AWS Backup job using EventBridge. It's not the exact same scenario since they're triggering the Lambda from the Backup AND Restore jobs, but you should be able to extract the steps you need from that.

AWS cloudwatch alarm using CLI for EMR

How to write a cloudwatch alarm for EMR using CLI command??
My requirement is to terminate cluster which are idle for more than 2hours. I need to do this using aws CLI command.
From Monitor Metrics with CloudWatch - Amazon EMR:
The IsIdle metric tracks whether a cluster is live, but not currently running tasks. You can set an alarm to fire when the cluster has been idle for a given period of time, such as thirty minutes.
Therefore, you can configure an alarm on this metric. However, the alarm itself is not able to terminate an Amazon EMR cluster. You would need an additional component, such as an AWS Lambda function, to actually terminate the cluster.
The components would be:
An Amazon CloudWatch IsIdle metric (automatically provided)
An Alarm on the metric that triggers when the cluster is idle for longer than the desired period
Configure the Alarm to send a message to an Amazon SNS topic
Create an AWS Lambda function and subscribe the function to the SNS topic
Code the Lambda function to terminate the Amazon EMR cluster
There is a more complex version of this auto-shutdown process documented at: Optimize Amazon EMR costs with idle checks and automatic resource termination using advanced Amazon CloudWatch metrics and AWS Lambda | AWS Big Data Blog

Idea and guidelines on end to end AWS solution

I want to build an end to end automated system which consists of the following steps:
Getting data from source to landing bucket AWS S3 using AWS Lambda
Running some transformation job using AWS Lambda and storing in processed bucket of AWS S3
Running Redshift copy command using AWS Lambda to push the transformed/processed data from AWS S3 to AWS Redshift
From the above points, I've completed pulling data, transforming data and running manual copy command from a Redshift using a SQL query tool.
Doubts:
I've heard AWS CloudWatch can be used to schedule/automate things but never worked on it. So, if I want to achieve the steps above in a streamlined fashion, how to go about it?
Should I use Lambda to trigger copy and insert statements? Or are there better AWS services to do the same?
Any other suggestion on other AWS Services and of the likes are most welcome.
Constraint: Want as many tasks as possible to be serverless (except for semantic layer, Redshift).
CloudWatch:
Your options here are either to use CloudWatch Alarms or Events.
With alarms, you can respond to any metric of your system (eg CPU utilization, Disk IOPS, count of Lambda invocations etc) when it crosses some threshold, and when this alarm is triggered, invoke a lambda function (or send SNS notification etc) to perform a task.
With events you can use either a cron expression or some AWS service event (eg EC2 instance state change, SNS notification etc) to then trigger another service (eg Lambda), so you could for example run some kind of clean-up operation via lambda on a regular schedule, or create a snapshot of an EBS volume when its instance is shut down.
Lambda itself is a very powerful tool, and should allow you to program a decent copy/insert function in a language you are familiar with. AWS has several GitHub repos with lots of examples too, see for example the serverless examples and many samples. There may be other services which could work for you in your specific case, but part of Lambda's power is its flexibility.

How to terminate AWS EMR Cluster automatically after some time

I currently have a task at hand to Terminate a long-running EMR cluster after a set period of time (based on some metric). Google Dataproc has this capability in something called "Cluster Scheduled Deletion" Listed here: Cluster Scheduled Deletion
Is this something that is possible on EMR natively? Maybe using Cloudwatch metrics? Or can I write a long-running jar which will sit on the EMR Master node and just poll yarn for some idle time metric and then shut down the cluster after a set period of time?
Edit: For more clarification. I would like some functionality wherein the cluster is terminated based on idle for some x amount of time. e.g. If the cluster has been up for a while but no jobs have been run for say 1 hour and the cluster is just sitting there doing nothing, then I'd like the ability to terminate the cluster.
The easiest method would be used to Amazon EMR Metrics and Dimensions for Amazon CloudWatch. There is an isIdle boolean that "indicates that a cluster is no longer performing work".
You could create a CloudWatch Alarm that says if it is True for more than x minutes, then trigger the alarm. This would send a message to Amazon SNS, which can trigger a Lambda function to shutdown the cluster.
Components:
Amazon CloudWatch Alarm
Amazon SNS queue
AWS Lambda function
Update: This apparently isn't suitable (see comments below).
An alternate method would be:
Use Amazon CloudWatch Events to schedule a Lambda function every x seconds
The Lambda function looks for any clusters with a particular tag that indicates how long to wait until shutdown (eg 40 minutes). If the tag is not present, the cluster remains untouched.
The Lambda function queries the cluster state (somehow -- probably via a Hadoop API call), then:
If the cluster is idle and there is no Idle Since tag, add an Idle Since tag with the current timestamp
If the cluster is idle and it been more than x minutes since the timestamp in the Idle Since tag, terminate the cluster.
If the cluster is not idle, remove the Idle Since tag (if present)
Keeping in mind the clarification that you have provided in your question, there could be 3 possible ways to do that.
1) Using AWS CloudWatch metric isIdle of an EMR cluster. This metric tracks whether a cluster is live, but not currently running tasks. You can set an alarm to fire when the cluster has been idle for a given period of time, such as thirty minutes.
Reference: https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html
2) [Recommended] Using AWS CloudWatch event/rule and AWS Lambda function to check for Idle EMR clusters. You can achieve visibility on the AWS Console level and can easily enable and disable it.
[Recommended] Solution using 2nd Approach
Keeping in mind the need for this, I have developed a small framework to achieve that using the 2nd solution mentioned above. This framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.
You specify the maximum idle time threshold and AWS CloudWatch event/rule triggers an AWS Lambda function that queries all AWS EMR clusters in WAITING state and for each, compares the current time with AWS EMR cluster's ready time in case of no EMR steps added so far or compares the current time with AWS EMR cluster's last step's end time. If the threshold has been compromised, the AWS EMR will be terminated after removing termination protection if enabled. If not, it will skip that AWS EMR cluster.
AWS CloudWatch event/rule will decide how often AWS Lambda function should check for idle AWS EMR clusters.
You can disable the AWS CloudWatch event/rule at any time to disable this framework in a single click without deleting its AWS CloudFormation stack.
AWS Lambda function is using Python 3.7 as its runtime environment.
You can get the code and use it from GitHub here: https://github.com/abdullahkhawer/auto-terminate-idle-emr
Note: Any contributions, improvements, and suggestions to this solution that I developed will be highly appreciated.
3) Some other custom solution based on a Shell that runs against a CRON job on an EMR cluster's master node but you will lose its visibility on the AWS Console level and you may require SSH access as well.
I had to do a similar implementation and just considering Cluster Elapsed time was not solving our problem.
so we came up with a approach to hit the Hadoop API, you can find them here
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API
So here is what we did,
Ask the user who brings up a cluster to add a Tag like "AutoShutDown":"True:BufferMinutes", here "AutoShutDown" is the key and "True:BufferMinutes" is the value of the Tag
Here BufferMinutes is the time in minutes (30, 60 etc.)
create a Lambda to hit the hadoop api of all those clusters configured with step 1 (if the user does not add the Tag then the cluster is untouched) and fetch the end time of the last job that was completed (only if all jobs are either completed / terminated), if any job is still running then do nothing and exit.
now
datetime_difference = (current_time - lastFinished)
if(datetime_difference > requested_time)
{
terminate_cluster
}
Create a cloud watch trigger and add the lambda created as target to it, schedule the trigger to run as required.
Note: Lambda is written in python, so boto3 is used and client will be "emr" same like what abdullahkhawer mentioned in his solution above.
This implementation gives flexibility to the user to choose and reduces a great deal of burden on dev-ops.

Automation of on-demand AWS EMR cluster - Using Python (boto3) over AWS CLI

We are in the process of automating the launch of on demand EMR clusters. This will be triggered upon the arrival of certain files in AWS S3. In this regard, we are evaluating two options -
1. Shell script that will invoke a AWS CLI to launch the desired EMR cluster
2. Python script that will invoke methods for EMR start, stop using the boto3
Is there any preference of using one option over the other?
The former appears easier, as we can take the CLI from the manually created EMRs from the AWS console and package it into a shell script. While the later option has intricacies and doesn't have such a starting point and the methods would have to be written from scratch.
Appreciate your inputs in this regard.
While both can achieve what you want, I would suggest to go with Lambda (Python).
Create an event trigger on the S3 location where data is expected - this will invoke your lambda (python code) and lambda can in-turn launch your EMR.
s3-> lambda -> EMR
Another option could be to trigger a data pipeline from lambda which will create the EMR for you.
s3 -> lambda -> pipeline -> EMR
Advantages of using pipeline vs lambda to create EMR
GUI based: You can pick and choose the components needed like resources, activites, schedules etc.
Minimal Python: In the lambda you will just configure the pipeline to be triggered, you don't need to implement error handling, retries, success or failure emails etc. All of this is inbuilt in the pipelines
Flexible: Since pipeline components are modular and configurable, you can change any configuration quickly. Code changes often takes more time.
You can read more about it here - https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/what-is-datapipeline.html