I have an AWS EMR cluster (emr-4.2.0, Spark 1.5.2), where I am submitting steps from aws cli. My problem is, that if the Spark application fails, then YARN is trying to run the application again (under the same EMR step).
How can I prevent this?
I was trying to set --conf spark.yarn.maxAppAttempts=1, which is correctly set in Environment/Spark Properties, but it doesn't prevent YARN from restarting the application.
Related
I am developing a module to delete the emr cluster but before deletion I need to check whether any spark jobs are running in the cluster. Is there a way we can check this using boto3 API call or through CLI command. Basically, I dont need any YARN applications in running state. In console we can find this in 'Application User Interfaces' tab, under High-level application history.
I am submitting one spark application job jar to EMR, and it is using some property file. So I can put it into S3 and while creating the EMR I can download it and copy it at some location in EMR box if this is the best way how I can do this while creating the EMR cluster itself at bootstrapping time.
Check following snapshot
In edit software setting you can add your own configuration or JSON file ( which stored on S3 location ) and using this setting you can passed configure parameter to EMR cluster on creating time. For more details please check following links
Amazon EMR Cluster Configurations
Configuring Applications
AWS ClI
hope this will help you.
When I create an AWS EMR I can do so through their simple wizard on the AWS Management Console. Once it's completed I can test it out and when I'm happy with it's configuration I can simply click the AWS CLI Export button and copy the CLI command that creates the EMR.
I need to create an EMR as part of my AWS Data Pipeline process and rather than configure the EmrCluser and then running whatever EmrActivity I want I'm wondering if I could just copy my CLI command I exported during my testing and paste it inside a ShellCommandActivity which will create the EMR. From there I could use either an EmrActivity to do some processing or just use the ShellCommandActivity to do the processing.
Can I create my AWS Data Pipeline EMR Cluster using a CLI command that's run through a ShellCommandActivity? And if I do so will I be able to run an EmrActivity against that EMR Cluster? I just think it would be easier to create the EMR this way because I can use the AWS Management Console to create my EMR and then I can test my EMR before exporting the CLI command rather than going through the process of properly constructing the EMR through the AWS Data Pipeline wizard/JSON process. I.E., The actual EMR wizard on the AWS Management Console is way easier than the Data Pipeline wizard for creating the EMR on the AWS Management Console, especially when it comes to choosing my security groups and various configurations.
Update:
I just verified that I can in fact run a CLI command through the ShellCommandActivity to create my EMR through the Data Pipeline but is this possibly a code smell or bad practice? Are there any downfalls to creating and EMR on the Data Pipeline this way rather than doing it through the predefined EmrCluster command?
It's possible, but a little complicated:
The following action or the script itself would have to wait for the cluster to be created. Make sure the action does not time out.
The data pipeline does not know about the cluster, hence you need to specify a workerGroup instead of runsOn in the EMRActivity. You also need to install Task Runner on the cluster.
If I want to have long running EMR cluster and after that I want to setup Data Pipeline doing something on that cluster, how I can do it?
I must install Task Runner on this EMR cluster? Or maybe Task Runner will be preinstalled ? Or maybe there is other simple way ?
Task Runner does not come pre-installed in EMR. It has to be configured manually, follow these steps to install Task Runner in EMR cluster.
On starting the Task Runner process, provide a name for the --workerGroup. This name will be the identifier for this EMR cluster and can be used for the WorkerGroup field in Datapipeline activities.
I am running a spark cluster on AWS EMR. How do I get all all the details of the jobs and executors that are running on AWS EMR without using the spark UI. I am going to use it for monitoring and optimization.
You can checkout nagios or ganglia for cluster health but you cant see the jobs running on spark with these tools.
If you are using AWS EMR you can do that using lynx server. something like below.
Login to the master node of the cluster.
try the below command
lynx http://localhost:4040
Note : before you type the command make sure you are running a job