When running a step (for example loading data) on my AWS EMR cluster via the terminal, is it possible to automatically return a message in my terminal when the step has finished? Instead of having to check it myself every several minutes?
AFAIK, you can only wait for the EMR cluster to terminate using aws-cli. If you need status of each tasks, I think you need to write something custom. May be the Cloudwatch can be a choice as well. As there are EMR metrics by default sent to cloudwatch. More details here. Hope this helps.
Related
so im trying to run Terraform through CodePipeline. I need to manage a fleet of clusters. It seems CodePipeline is one of the good ways to trigger certain pipelines on some conditions.
I have a very simple requirement - i want to see the terraform execution in real time. i want to expose the CodePipeline run in a way that i can stream this. Is this where EventBridge is used. I tried to look at an EventBridge example here - https://medium.com/hackernoon/monitoring-ci-cd-pipelines-with-amazon-eventbridge-32177e2f2c3e - but it doesnt seem to be streaming run output in real time.
Which event or hook to should i attach to? And is CodePipeline even the right thing to use here ?
Which event or hook to should I attach to?
You're looking at the wrong AWS service. EventBridge is not for streaming log output. It is for discrete events, not a stream.
Your CodePipeline would be using a CodeBuild task to execute Terraform. Your CodeBuild task will be configured to log to AWS CloudWatch Logs. You can view the CloudWatch Logs output in the AWS CloudWatch web console, with the option to poll for new log output.
You can also do the same in a command line console with the aws logs tail command, documented here.
To do the same thing in your own code you would have to write your code to poll the CloudWatch Logs API in an loop.
And is CodePipeline even the right thing to use here?
Yes absolutely
In my architecture when I receive a new file on S3 bucket, a lambda function triggers an ECS task.
The problem occurs when I receive multiple files at the same time: the lambda will trigger multiple instance of the same ECS task that acts on the same shared resources.
I want to ensure only 1 instance is running for specific ECS Task, how can I do?
Is there a specific setting that can ensure it?
I tried to query ECS Cluster before run a new instance of the ECS task, but (using AWS Python SDK) I didn't receive any information when the task is in PROVISIONING status, the sdk only return data when the task is in PENDING or RUNNING.
Thank you
I don't think you can control that because your S3 event will trigger new tasks. It will be more difficult to check if the task is already running and you might miss execution if you receive a lot of files.
You should think different to achieve what you want. If you want only one task processing that forget about triggering the ECS task from the S3 event. It might work better if you implement queues. Your S3 event should add the information (via Lambda, maybe?) to an SQS queue.
From there you can have an ECS service doing a SQS long polling and processing one message at a time.
I have an AWS Step Function with many state transitions that can run for a half hour or more.
There are only a few states, and the application loops through them until it runs out of items to process.
I have a run that failed after about half an hour. I can look at the logging under the "Execution event history". However, since this logs every transition and state, there are thousands of events. I cannot page down to show enough events (clicking the "Load More" button) without hanging my browser window.
There is no way to sort or filter this list that I can see.
How can I find the cause of the failure? Is there a way to export the Execution event history somewhere? Or send it to CloudWatch?
You can use the AWS CLI command aws stepfunctions get-execution-history with the --reverse-order flag in order to get the logs from the most recent (where the errors will be) first.
How do you process your steps? Docker containers on ECS or Fargate? Give us some details on that.
Your tasks should be sending out logs to CloudWatch as they execute.
You can also look at the Docker logs themselves on the physical machine if your run docker on a machine you can SSH to.
My case is the following. I want to launch a cluster during working hours and terminate it after 18:00 and weekends. The clusters will be used for a datascience project. Years ago we would use a boring crontab for this, but these days i prefer to do this with a lambda function.
In boto3 i can launch a cluster (thanks to Jose Quinteiro) and this post describes it very well How to launch and configure an EMR cluster using boto
How can i terminate a cluster in boto3 in the same lambda function as where i start it?
Using AWS CloudWatch event/rule and AWS Lambda function to check for Idle EMR clusters, you complete your goal. You achieve visibility on the AWS Console level and can easily enable and disable it.
Keeping in mind the need for this, I have developed a small framework to achieve that using the 2nd solution mentioned above. This framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.
You specify the maximum idle time threshold and AWS CloudWatch event/rule triggers an AWS Lambda function that queries all AWS EMR clusters in WAITING state and for each, compares the current time with AWS EMR cluster's ready time in case of no EMR steps added so far or compares the current time with AWS EMR cluster's last step's end time. If the threshold has been compromised, the AWS EMR will be terminated after removing termination protection if enabled. If not, it will skip that AWS EMR cluster.
AWS CloudWatch event/rule will decide how often AWS Lambda function should check for idle AWS EMR clusters.
You can disable the AWS CloudWatch event/rule at any time to disable this framework in a single click without deleting its AWS CloudFormation stack.
AWS Lambda function is using Python 3.7 as its runtime environment.
In your case, while creating the stack, you can specify your required Cron expression and maximum idle EMR cluster threshold in minutes to achieve this.
You can get the code and use it from GitHub here: https://github.com/abdullahkhawer/auto-terminate-idle-emr
Any contributions, improvements and suggestions to this solution will be highly appreciated. :)
You can terminate the cluster using boto3 by using
emr_client = boto3.client('emr')
emr_client.terminate_job_flows(JobFlowIds=[#replace it with cluster Id you want it to close ])
You could create a scheduled event in cloudwatch that triggers the lambda you are using.
Scheduled events use Cron expressions so you will be able to apply the same logic. Once your function is triggered you will need to determine that it is a shutdown trigger from the event input.
I currently have a task at hand to Terminate a long-running EMR cluster after a set period of time (based on some metric). Google Dataproc has this capability in something called "Cluster Scheduled Deletion" Listed here: Cluster Scheduled Deletion
Is this something that is possible on EMR natively? Maybe using Cloudwatch metrics? Or can I write a long-running jar which will sit on the EMR Master node and just poll yarn for some idle time metric and then shut down the cluster after a set period of time?
Edit: For more clarification. I would like some functionality wherein the cluster is terminated based on idle for some x amount of time. e.g. If the cluster has been up for a while but no jobs have been run for say 1 hour and the cluster is just sitting there doing nothing, then I'd like the ability to terminate the cluster.
The easiest method would be used to Amazon EMR Metrics and Dimensions for Amazon CloudWatch. There is an isIdle boolean that "indicates that a cluster is no longer performing work".
You could create a CloudWatch Alarm that says if it is True for more than x minutes, then trigger the alarm. This would send a message to Amazon SNS, which can trigger a Lambda function to shutdown the cluster.
Components:
Amazon CloudWatch Alarm
Amazon SNS queue
AWS Lambda function
Update: This apparently isn't suitable (see comments below).
An alternate method would be:
Use Amazon CloudWatch Events to schedule a Lambda function every x seconds
The Lambda function looks for any clusters with a particular tag that indicates how long to wait until shutdown (eg 40 minutes). If the tag is not present, the cluster remains untouched.
The Lambda function queries the cluster state (somehow -- probably via a Hadoop API call), then:
If the cluster is idle and there is no Idle Since tag, add an Idle Since tag with the current timestamp
If the cluster is idle and it been more than x minutes since the timestamp in the Idle Since tag, terminate the cluster.
If the cluster is not idle, remove the Idle Since tag (if present)
Keeping in mind the clarification that you have provided in your question, there could be 3 possible ways to do that.
1) Using AWS CloudWatch metric isIdle of an EMR cluster. This metric tracks whether a cluster is live, but not currently running tasks. You can set an alarm to fire when the cluster has been idle for a given period of time, such as thirty minutes.
Reference: https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html
2) [Recommended] Using AWS CloudWatch event/rule and AWS Lambda function to check for Idle EMR clusters. You can achieve visibility on the AWS Console level and can easily enable and disable it.
[Recommended] Solution using 2nd Approach
Keeping in mind the need for this, I have developed a small framework to achieve that using the 2nd solution mentioned above. This framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.
You specify the maximum idle time threshold and AWS CloudWatch event/rule triggers an AWS Lambda function that queries all AWS EMR clusters in WAITING state and for each, compares the current time with AWS EMR cluster's ready time in case of no EMR steps added so far or compares the current time with AWS EMR cluster's last step's end time. If the threshold has been compromised, the AWS EMR will be terminated after removing termination protection if enabled. If not, it will skip that AWS EMR cluster.
AWS CloudWatch event/rule will decide how often AWS Lambda function should check for idle AWS EMR clusters.
You can disable the AWS CloudWatch event/rule at any time to disable this framework in a single click without deleting its AWS CloudFormation stack.
AWS Lambda function is using Python 3.7 as its runtime environment.
You can get the code and use it from GitHub here: https://github.com/abdullahkhawer/auto-terminate-idle-emr
Note: Any contributions, improvements, and suggestions to this solution that I developed will be highly appreciated.
3) Some other custom solution based on a Shell that runs against a CRON job on an EMR cluster's master node but you will lose its visibility on the AWS Console level and you may require SSH access as well.
I had to do a similar implementation and just considering Cluster Elapsed time was not solving our problem.
so we came up with a approach to hit the Hadoop API, you can find them here
https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Scheduler_API
So here is what we did,
Ask the user who brings up a cluster to add a Tag like "AutoShutDown":"True:BufferMinutes", here "AutoShutDown" is the key and "True:BufferMinutes" is the value of the Tag
Here BufferMinutes is the time in minutes (30, 60 etc.)
create a Lambda to hit the hadoop api of all those clusters configured with step 1 (if the user does not add the Tag then the cluster is untouched) and fetch the end time of the last job that was completed (only if all jobs are either completed / terminated), if any job is still running then do nothing and exit.
now
datetime_difference = (current_time - lastFinished)
if(datetime_difference > requested_time)
{
terminate_cluster
}
Create a cloud watch trigger and add the lambda created as target to it, schedule the trigger to run as required.
Note: Lambda is written in python, so boto3 is used and client will be "emr" same like what abdullahkhawer mentioned in his solution above.
This implementation gives flexibility to the user to choose and reduces a great deal of burden on dev-ops.