How to terminate AWS EMR Cluster automatically after some time - amazon-web-services

I currently have a task at hand to Terminate a long-running EMR cluster after a set period of time (based on some metric). Google Dataproc has this capability in something called "Cluster Scheduled Deletion" Listed here: Cluster Scheduled Deletion
Is this something that is possible on EMR natively? Maybe using Cloudwatch metrics? Or can I write a long-running jar which will sit on the EMR Master node and just poll yarn for some idle time metric and then shut down the cluster after a set period of time?
Edit: For more clarification. I would like some functionality wherein the cluster is terminated based on idle for some x amount of time. e.g. If the cluster has been up for a while but no jobs have been run for say 1 hour and the cluster is just sitting there doing nothing, then I'd like the ability to terminate the cluster.

The easiest method would be used to Amazon EMR Metrics and Dimensions for Amazon CloudWatch. There is an isIdle boolean that "indicates that a cluster is no longer performing work".
You could create a CloudWatch Alarm that says if it is True for more than x minutes, then trigger the alarm. This would send a message to Amazon SNS, which can trigger a Lambda function to shutdown the cluster.
Components:
Amazon CloudWatch Alarm
Amazon SNS queue
AWS Lambda function
Update: This apparently isn't suitable (see comments below).
An alternate method would be:
Use Amazon CloudWatch Events to schedule a Lambda function every x seconds
The Lambda function looks for any clusters with a particular tag that indicates how long to wait until shutdown (eg 40 minutes). If the tag is not present, the cluster remains untouched.
The Lambda function queries the cluster state (somehow -- probably via a Hadoop API call), then:
If the cluster is idle and there is no Idle Since tag, add an Idle Since tag with the current timestamp
If the cluster is idle and it been more than x minutes since the timestamp in the Idle Since tag, terminate the cluster.
If the cluster is not idle, remove the Idle Since tag (if present)

Keeping in mind the clarification that you have provided in your question, there could be 3 possible ways to do that.
1) Using AWS CloudWatch metric isIdle of an EMR cluster. This metric tracks whether a cluster is live, but not currently running tasks. You can set an alarm to fire when the cluster has been idle for a given period of time, such as thirty minutes.
Reference: https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html
2) [Recommended] Using AWS CloudWatch event/rule and AWS Lambda function to check for Idle EMR clusters. You can achieve visibility on the AWS Console level and can easily enable and disable it.
[Recommended] Solution using 2nd Approach
Keeping in mind the need for this, I have developed a small framework to achieve that using the 2nd solution mentioned above. This framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.
You specify the maximum idle time threshold and AWS CloudWatch event/rule triggers an AWS Lambda function that queries all AWS EMR clusters in WAITING state and for each, compares the current time with AWS EMR cluster's ready time in case of no EMR steps added so far or compares the current time with AWS EMR cluster's last step's end time. If the threshold has been compromised, the AWS EMR will be terminated after removing termination protection if enabled. If not, it will skip that AWS EMR cluster.
AWS CloudWatch event/rule will decide how often AWS Lambda function should check for idle AWS EMR clusters.
You can disable the AWS CloudWatch event/rule at any time to disable this framework in a single click without deleting its AWS CloudFormation stack.
AWS Lambda function is using Python 3.7 as its runtime environment.
You can get the code and use it from GitHub here: https://github.com/abdullahkhawer/auto-terminate-idle-emr
Note: Any contributions, improvements, and suggestions to this solution that I developed will be highly appreciated.
3) Some other custom solution based on a Shell that runs against a CRON job on an EMR cluster's master node but you will lose its visibility on the AWS Console level and you may require SSH access as well.

I had to do a similar implementation and just considering Cluster Elapsed time was not solving our problem.
so we came up with a approach to hit the Hadoop API, you can find them here
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API
So here is what we did,
Ask the user who brings up a cluster to add a Tag like "AutoShutDown":"True:BufferMinutes", here "AutoShutDown" is the key and "True:BufferMinutes" is the value of the Tag
Here BufferMinutes is the time in minutes (30, 60 etc.)
create a Lambda to hit the hadoop api of all those clusters configured with step 1 (if the user does not add the Tag then the cluster is untouched) and fetch the end time of the last job that was completed (only if all jobs are either completed / terminated), if any job is still running then do nothing and exit.
now
datetime_difference = (current_time - lastFinished)
if(datetime_difference > requested_time)
{
terminate_cluster
}
Create a cloud watch trigger and add the lambda created as target to it, schedule the trigger to run as required.
Note: Lambda is written in python, so boto3 is used and client will be "emr" same like what abdullahkhawer mentioned in his solution above.
This implementation gives flexibility to the user to choose and reduces a great deal of burden on dev-ops.

Related

Is there any way to trigger an AWS Lambda if Cloudwatch logs haven't been updated in X amount of time?

I have some ECS tasks running in AWS Fargate which in very rare cases may "die" internally, but will still show as "RUNNING" and not fail and trigger the task to restart.
What I would like to do, if possible is check for the absence of logs, e.g. if logs haven't been written in 30 minutes, trigger a lambda to kill the ECS task which will cause it to start back up.
The health check functionality isn't sufficient.
If this isn't possible, are there any other approaches I could consider?
you can have metric and anomaly detection but it may cost for metric to process logs + alarm may cost too. Would rather do lambda run every 30min which would check if logs are there and then would kill ECS as needed. you can run lambda on interval with cloudwatch events bridge.
Logs are probably sent to cloudwatch logs group from your ECS, if you have static name of the logs group, you can use SDK to describe streams inside the group. This api call will tell you timestamp of the last data in stream.
inside lambda nodejs context aws-sdk v2 is already present, so you can require w/o install. here is doc for v2:
https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/CloudWatchLogs.html#describeLogStreams-property
pick to orderBy: "LastEventTime" and to save networking time, set limit from default 50 to 1 limit: 1 and in result you will have lastEventTimestamp
anomaly detection:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html
alarms:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html
check pricing for these, there is free tier, so maybe it won't cost you anything, yet it's easy to build up real $ spend with cloudwatch. https://aws.amazon.com/cloudwatch/pricing/
To run lambda on interval:

Auto Terminate EMR Cluster using Step Functions

I have a use case where I would submitting dynamic number of jobs to the cluster, hence opting to submit jobs via SDK from a lambda and not add submit jobs as a task in step function. The EMR cluster would be used once a week and hence want to opt for onDemand variant.
Looks like "auto-terminate" parameter is not supported when creating cluster from Step Functions. As per the doc, The field Instances.KeepJobFlowAliveWhenNoSteps is mandatory, and must have the Boolean value TRUE.
Is there an alternative way to terminate cluster after all jobs are completed?
You have few options to terminate the cluster, but it depends on your scenerio.
Since you are using Lambda, you can check for the state of cluster periodically and if its is WAITING, you can terminate the cluster with the ID. You can also make a CloudWatch event with AWS Lambda function to check if EMR cluster is Idle. you can find a good answer for this specific approach here and the code implementation by the same user here
A very naive and stupid thing but can work is to deliberately submit a failing step as the final step and use 'TERMINATE_CLUSTER' on option key ActionOnFailure while submitting with add_job_flow_steps()
Update on your question:
would there be potential race condition where in EMR cluster could
terminate after its started and before jobs got submitted?
The waiting time between the cluster staring and jobs submission/first job running isnt same, you can have a logic around deciding maximum idle time threshold for cloudwatch

Idea and guidelines on end to end AWS solution

I want to build an end to end automated system which consists of the following steps:
Getting data from source to landing bucket AWS S3 using AWS Lambda
Running some transformation job using AWS Lambda and storing in processed bucket of AWS S3
Running Redshift copy command using AWS Lambda to push the transformed/processed data from AWS S3 to AWS Redshift
From the above points, I've completed pulling data, transforming data and running manual copy command from a Redshift using a SQL query tool.
Doubts:
I've heard AWS CloudWatch can be used to schedule/automate things but never worked on it. So, if I want to achieve the steps above in a streamlined fashion, how to go about it?
Should I use Lambda to trigger copy and insert statements? Or are there better AWS services to do the same?
Any other suggestion on other AWS Services and of the likes are most welcome.
Constraint: Want as many tasks as possible to be serverless (except for semantic layer, Redshift).
CloudWatch:
Your options here are either to use CloudWatch Alarms or Events.
With alarms, you can respond to any metric of your system (eg CPU utilization, Disk IOPS, count of Lambda invocations etc) when it crosses some threshold, and when this alarm is triggered, invoke a lambda function (or send SNS notification etc) to perform a task.
With events you can use either a cron expression or some AWS service event (eg EC2 instance state change, SNS notification etc) to then trigger another service (eg Lambda), so you could for example run some kind of clean-up operation via lambda on a regular schedule, or create a snapshot of an EBS volume when its instance is shut down.
Lambda itself is a very powerful tool, and should allow you to program a decent copy/insert function in a language you are familiar with. AWS has several GitHub repos with lots of examples too, see for example the serverless examples and many samples. There may be other services which could work for you in your specific case, but part of Lambda's power is its flexibility.

launching AND terminating EMR cluster with boto3 on AWS Lambda

My case is the following. I want to launch a cluster during working hours and terminate it after 18:00 and weekends. The clusters will be used for a datascience project. Years ago we would use a boring crontab for this, but these days i prefer to do this with a lambda function.
In boto3 i can launch a cluster (thanks to Jose Quinteiro) and this post describes it very well How to launch and configure an EMR cluster using boto
How can i terminate a cluster in boto3 in the same lambda function as where i start it?
Using AWS CloudWatch event/rule and AWS Lambda function to check for Idle EMR clusters, you complete your goal. You achieve visibility on the AWS Console level and can easily enable and disable it.
Keeping in mind the need for this, I have developed a small framework to achieve that using the 2nd solution mentioned above. This framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.
You specify the maximum idle time threshold and AWS CloudWatch event/rule triggers an AWS Lambda function that queries all AWS EMR clusters in WAITING state and for each, compares the current time with AWS EMR cluster's ready time in case of no EMR steps added so far or compares the current time with AWS EMR cluster's last step's end time. If the threshold has been compromised, the AWS EMR will be terminated after removing termination protection if enabled. If not, it will skip that AWS EMR cluster.
AWS CloudWatch event/rule will decide how often AWS Lambda function should check for idle AWS EMR clusters.
You can disable the AWS CloudWatch event/rule at any time to disable this framework in a single click without deleting its AWS CloudFormation stack.
AWS Lambda function is using Python 3.7 as its runtime environment.
In your case, while creating the stack, you can specify your required Cron expression and maximum idle EMR cluster threshold in minutes to achieve this.
You can get the code and use it from GitHub here: https://github.com/abdullahkhawer/auto-terminate-idle-emr
Any contributions, improvements and suggestions to this solution will be highly appreciated. :)
You can terminate the cluster using boto3 by using
emr_client = boto3.client('emr')
emr_client.terminate_job_flows(JobFlowIds=[#replace it with cluster Id you want it to close ])
You could create a scheduled event in cloudwatch that triggers the lambda you are using.
Scheduled events use Cron expressions so you will be able to apply the same logic. Once your function is triggered you will need to determine that it is a shutdown trigger from the event input.

run scheduled task in AWS without cron

Currently I have a single server in amazon where I put all my cronjobs. I want to eliminate this single point of failure, and expose all my tasks as web services. I'd like to expose the services behind a VPC ELB to a few servers that will run the tasks when called.
Is there some service that Amazon (AWS) offers that can run a reoccurring job (really call a webservice) at scheduled intervals? I'd really like to be able to keep the cron functionality in terms of time/day specification, but farm out the HA of the driver (thing that calls endpoints at the right time) to AWS.
I like how SQS offers web endpoint(s), but from what I can tell you cant schedule them. SWF doesn't seem to be a good fit either.
AWS announced support for scheduled functions in Lambda at its 2015 re:Invent conference. With this feature users can execute Lambda functions on a scheduled basis using a cron-like syntax. The Lambda docs show an example of using Python to perform scheduled events.
Currently, the minimum resolution that a scheduled lambda can run at is 1 minute (the same as cron, but not as fine grained as systemd timers).
The Lambder project helps to simplify the use of scheduled functions on Lambda.
λ Gordon's cron example has perhaps the simplest interface for deploying scheduled lambda functions.
Original answer, saved for posterity.
As Eric Hammond and others have stated, there is no native AWS service for scheduled tasks. There are only workarounds and half solutions as mentioned in other answers.
To recap the current options:
The single-instance autoscale group that starts and stops on a schedule, as described by Eric Hammond.
Using a Simple Workflow Service timer, which is not at all intuitive. This case study mentions that JPL used SWF to build a distributed cron, but there are no implementation details. There is also a reference to a code example buried in the SWF code samples.
Run it yourself using something like cronlock.
Use something like the Unreliable Town Clock (UTC) to run Lambda functions on a schedule. Remember that Lambda cannot currently access resources within a VPC
Hopefully a better solution will come along soon.
Introducing Events in AWS Cloudwatch
You can schedule by minute, hourly, days or using CRON expression using console and without Lambda or any programming.
I just scheduled my ASP.net WEB API(HTTP Post) using SNS HTTP endpoint to execute every minute and it's working perfectly.
Is there some service that Amazon (AWS) offers that can run a reoccurring job at scheduled intervals?
This is one of a few single points of failure that people (including me) keep mentioning when designing architectures with AWS. Until Amazon solves it with a service, here's a hack I've published which is actively used by some companies.
AWS Auto Scaling can run and terminate instances using a recurring schedule specified in the cron format.
http://docs.amazonwebservices.com/AutoScaling/latest/APIReference/API_PutScheduledUpdateGroupAction.html
You can have the instance automatically run a process on startup.
If you don't know how long the job will last, you can set things up so that your job terminates the instance when it has completed.
Here's an article I wrote that walks through exact commands needed to set this up:
Running EC2 Instances on a Recurring Schedule with Auto Scaling
http://alestic.com/2011/11/ec2-schedule-instance
Starting a whole instance just to kick off a set of jobs seems a bit like overkill, but if it's a t1.micro, then it only costs a couple pennies.
That t1.micro doesn't have to do the actual work either. Your instance could inject messages into SQS or through SNS so that the other redundant servers pick up the tasks.
This a hosted third party site that can regularly call scheduled scripts on your domain.
This will not work if you need your script to run in the shell, and not as Apache.
Sounds like this might be useful to you:
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-using-task-runner.html
Task Runner is a task agent application that polls AWS Data Pipeline
for scheduled tasks and executes them on Amazon EC2 instances, Amazon
EMR clusters, or other computational resources, reporting status as it
does so. Depending on your application, you may choose to:
Allow AWS Data Pipeline to install and manage one or more Task Runner
applications for you on computational resources that it manages
automatically. In this case, you do not need to install or configure
Task Runner as described in this section. This is the recommended
configuration.
Manually install and configure Task Runner on a computational resource
such as a long-running EC2 instance or a physical server. To do so,
use the procedures in this section.
Develop and install a custom task agent instead of Task Runner. The
procedures for doing so will depend on the implementation of the
custom task agent.
Amazon has introducted Lambda last year for NodeJS, yesterday Amazon added the features Scheduled Functions, VPC Support, and Python Support.
By leveraging Scheduled Function - a proper replacement for CRON can be attained.
More Info - http://aws.amazon.com/lambda/details/
As of August 2020, Amazon has moved the Lambda/CloudWatch events to a service called EventBridge (https://aws.amazon.com/eventbridge/). It was launched in July 2019, after most of the answers to this question.
Looks like this is a relatively new option from AWS BeanStalk:
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features-managing-env-tiers.html#worker-periodictasks
Basically, they act like regular SQS receivers, but they're called on a cron schedule instead of in response to a SQS message.
SWF is a Web service from AWS that can be used to schedule tasks. Most of the work goes into specifying what a task and a schedule is.
http://milindparikh.blogspot.com/2015/07/introducing-diksha-aws-lambda-function.html is a scalable scheduler written against SWF.
CloudWatch Events are great, but there is a limit on their number. If you need a scale and willing to sacrifice the precision you could use DynamoDB's TTL as a timer.
The idea is to put items into a DynamoDB table with a TTL set to the time you need to run a task. DynamoDB will delete those items somewhere around the specified time (within 48 hours of expiration). Those deleted items will appear in the DynamoDB stream, associated with a table. A lambda function could listen the stream and take appropriate actions upon the deletions.
Read more in "DynamoDB TTL as an ad-hoc scheduling mechanism" by theburningmonk.com.
The AWS Elastic Load Balancers will ping your instances to check that they're healthy. You can add your cron-like tasks to the script that the ELB is pinging, and it will execute very regularly.
You'd want to add some logic so that each tasks is executed the right amount of times and at the right interval, but this could be accomplished with a database table that tracks executions. Each time the ELB pings your server, your server would check the database to see if any job is pending, and then execute that job.
The ELB will timeout if the script takes too long to execute, so it's important to not create a situation where your ELB health check will take many seconds to process the cron tasks. To overcome this, you can employ the AWS Simple Notification Service. Your ELB health check script can simply publish a message to an SNS topic, and then that topic can deliver the message via an HTTP request to your web server.
In other words:
ELB pings your EC2 instance...
EC2 instance checks for pending jobs and sends a message to SNS if any are found...
SNS notifies your app via HTTP...
The HTTP call from SNS is what actually processes the cron job