How to add an EMR Spark Step? - amazon-web-services

According to the docs:
For Step type, choose Spark application.
But in Amazon EMR -> Clusters -> mycluster -> Steps -> Add step -> Step type, the only options are:

There are two ways to add EMR spark steps:
- Using command-runner.jar (custom application)
spark-submit --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar 10
Using aws cli to do the same
aws emr add-steps --cluster-id j-xxxxxxxx --steps Name="add emr step to run spark",Jar="command-runner.jar",Args=[spark-submit,--class,org.apache.spark.examples.SparkPi,/usr/lib/spark/examples/jars/spark-examples.jar,10]
- Spark Application

I don't have a Spark Application option because I created a Core Hadoop cluster.
When I created the cluster, under Software configuration, I should have chosen Spark, then I would have had the Spark application option under Step type.

You can use command-runner.jar for your use case. For the step type let it be Custom Jar from the options that you have.
Check out this image for detail.
You can read more about command-runner.jar command-runner-usage

Related

aws emr add-steps a spark application

I'd like to add a step as a spark application using AWS CLI, but I cannot find a working command, from AWS official doc: https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html, they listed out 6 examples, none of them is for spark.
But I could configure it through AWS Console UI and it runs fine, but for efficiency, I'd like to be able to do so via aws cli.
The closest that I could come up with is this command:
aws emr add-steps --cluster-id j-cluster-id --steps Type=SPARK,Name='SPARK APP',ActionOnFailure=CONTINUE,Jar=s3://my-test/RandomJava-1.0-SNAPSHOT.jar,MainClass=JavaParquetExample1,Args=s3://my-test/my-file_0000_part_00.parquet,my-test --profile my-test --region us-west-2
but this resulted in this configuration on AWS EMR step:
JAR location : command-runner.jar
Main class : None
Arguments : spark-submit s3://my-test/my-file_0000_part_00.parquet my-test
Action on failure: Continue
which resulted in failure.
The correct one (completed successfully, configured through AWS Console UI) looks like this:
JAR location : command-runner.jar
Main class : None
Arguments : spark-submit --deploy-mode cluster --class sparkExamples.JavaParquetExample1 s3://my-test/RandomJava-1.0-SNAPSHOT.jar --s3://my-test/my-file_0000_part_00.parquet --my-test
Action on failure: Continue
Any help is greatly appreciated!
This seems to be working for me. I am adding a spark application to a cluster with the step name My step name. Let's say you name the file as step-addition.sh. The content of it is following:
#!/bin/bash
set -x
#cluster id
clusterId=$1
startDate=$2
endDate=$3
aws emr add-steps --cluster-id $clusterId --steps Type=Spark,Name='My step name',\
ActionOnFailure=TERMINATE_CLUSTER,Args=[\
"--deploy-mode","cluster","--executor-cores","1","--num-executors","20","--driver-memory","10g","--executor-memory","3g",\
"--class","your-package-structure-like-com.a.b.c.JavaParquetExample1",\
"--master","yarn",\
"--conf","spark.driver.my.custom.config1=my-value-1",\
"--conf","spark.driver.my.custom.config2=my-value-2",\
"--conf","spark.driver.my.custom.config.startDate=${startDate}",\
"--conf","spark.driver.my.custom.config.endDate=${endDate}",\
"s3://my-bucket/my-prefix/path-to-your-actual-application.jar"]
You can execute the above script simply like this:
bash $WORK_DIR/step-addition.sh $clusterId $startDate $endDate

spark-submit from outside AWS EMR cluster

I have an AWS EMR cluster running spark, and I'd like to submit a PySpark job to it from my laptop (--master yarn) to run in cluster mode.
I know that I need to set up some config on the laptop, but I'd like to know what the bare minimum is. Do I just need some of the config files from the master node of the cluster? If so, which? Or do I need to install hadoop or yarn on my local machine?
I've done a fair bit of searching for an answer, but I haven't yet been able to be sure that what I was reading referred to launching a job from the master of the cluster or some arbitrary laptop...
If you want to run the spark-submit job solely on your AWS EMR cluster, you do not need to install anything locally. You only need the EC2 key pair you specified in the Security Options when you created the cluster.
I personally scp over any relevant scripts &/or jars, ssh into the master node of the cluster, and then run spark-submit.
You can specify most of the relevant spark job configurations via spark-submit itself. AWS documents in some more detail how to configure spark-submit jobs.
For example:
>> scp -i ~/PATH/TO/${SSH_KEY} /PATH/TO/PYSPARK_SCRIPT.py hadoop#${PUBLIC_MASTER_DNS}:
>> ssh -i ~/PATH/TO/${SSH_KEY} hadoop#${PUBLIC_MASTER_DNS}
>> spark-submit --conf spark.OPTION.OPTION=VALUE PYSPARK_SCRIPT.py
However, if you already pass a particular configuration when creating the cluster itself, you do not need to re-specify those same configuration options via spark-submit.
You can setup the AWS CLI on your local machine, put your deployment on S3, and then add an EMR step to run on the EMR cluster. Something like this:
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE
Source: https://aws.amazon.com/de/blogs/big-data/submitting-user-applications-with-spark-submit/

Can I run a job on EMR like on my local cluster

I have build a local cluster on my laptop (pseudo mode). Where I run different mapreduce commands like
hadoop-streaming -D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-files my_mapper.py,my_reducer.py \
-mapper my_mapper.py \
-reducer my_reducer.py \
-input /aws/input/input_warc.txt \
-output /aws/output
Now I have to run it on EMR. There are two options that can be used one is console and second is aws cli. I want to run exactly comands like above. For that, I think if I ssh to EMR master, then I should be able to run this command. Is it a right way or is there any drawback of this approch ?
Yes, you may SSH to your cluster and run your jobs there, but you may also use the Step API (http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-steps.html) to run arbitrary commands on the master instance, including of course running distributed jobs like your example. You may add Steps to a cluster using the AWS CLI ("aws emr add-step ..." or also during cluster creation using "aws emr create-cluster ... --steps ...") or similarly using the AWS SDKs (like the AWS Java SDK) or using the AWS EMR Console.
Some advantages of the Step API include that it captures the output of each step so that you can view it via the AWS CLI, SDK, or AWS Console, and you can also check the status of Steps to determine when they have completed.
One disadvantage of the Step API is that currently Steps all run sequentially, so you can't have multiple Steps running in parallel.

Is it possible to get the name of AWS EMR step currently executing without going to console

If there are 3 steps that are executed in EMR, is it possible to know which step is currently executing, using shell script.
I am able to get current state of the EMR cluster, using this:
`aws emr describe-cluster --cluster-id ${cluster_id} | python -c 'import json,sys;obj=json.load(sys.stdin);print obj["Cluster"]["Status"]["State"]'`
but I am not able to get the information about the name of step currently executing, is it possible.
The list-steps option in the AWS CLI for EMR returns each step. It can be filtered to only return the steps in a certain state: aws emr list-steps --step-states "RUNNING"
(See http://docs.aws.amazon.com/cli/latest/reference/emr/list-steps.html)

Enable debugging/logging on running EMR cluster

I forgot to add the following two options when creating my EMR cluster with the AWS CLI:
--log-uri s3://mybucket/logs/ --enable-debugging
Is there a way to add a log uri and enable debugging on a running cluster?
You can not do enable debugging to running cluster.
The logs to s3 also needs to be enabled while you create cluster
By default, each cluster writes log files on the master node. These are written to the /mnt/var/log/ directory. You can access them by using SSH to connect to the master node
For more details please check here.