How can I issue an hdfs command as a step in an EMR cluster? Adding the step as a script_runner.jar task seems to fail oddly.
Use command-runner.jar and call out to bash to issue your hdfs command.
aws emr add-steps --cluster-id j-XXXXXXXXX --steps Name="Command Runner",Jar="command-runner.jar",Args=[/bin/bash,-c,"\"hdfs dfs -mkdir /tmp/foo\""]
Note that the final argument is passed as one single escape quoted string to bash.
Related
We are having many HDFS commands on our ETL scripts. Due to cloud migration we need to convert the HDFS commands to AWS S3 commadns.
We have a command as below:
hdfs dfs -text ${OUTPUT_PATH}/part* | hdfs dfs -cp ${INPUT_PATH}/${FILENAME}
Need help to convert this to S3 command, mainly the first part with '-text'.
How to add environment variables to an EMR cluster.
Currently, I have added them in a .sh file and was using script-runner.jar to run the script.
#!/bin/bash
export PYSPARK_PYTHON=/home/hadoop/bin/python
export PYSPARK_DRIVER_PYTHON=/home/hadoop/bin/python
Like this I was submitting the script as mentioned here:
aws emr add-steps \
--cluster-id j-2AXXXXXXGAPLF \
--steps Type=CUSTOM_JAR,Name="Run a script from S3 with script-runner.jar",ActionOnFailure=CONTINUE,Jar=s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://mybucket/my-script.sh]
I have also tried using command-runner.jar. Both the approaches did not work. Can you suggest some other approach to add env variables to the cluster remotely/from an EC2 instance?
I'd like to add a step as a spark application using AWS CLI, but I cannot find a working command, from AWS official doc: https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html, they listed out 6 examples, none of them is for spark.
But I could configure it through AWS Console UI and it runs fine, but for efficiency, I'd like to be able to do so via aws cli.
The closest that I could come up with is this command:
aws emr add-steps --cluster-id j-cluster-id --steps Type=SPARK,Name='SPARK APP',ActionOnFailure=CONTINUE,Jar=s3://my-test/RandomJava-1.0-SNAPSHOT.jar,MainClass=JavaParquetExample1,Args=s3://my-test/my-file_0000_part_00.parquet,my-test --profile my-test --region us-west-2
but this resulted in this configuration on AWS EMR step:
JAR location : command-runner.jar
Main class : None
Arguments : spark-submit s3://my-test/my-file_0000_part_00.parquet my-test
Action on failure: Continue
which resulted in failure.
The correct one (completed successfully, configured through AWS Console UI) looks like this:
JAR location : command-runner.jar
Main class : None
Arguments : spark-submit --deploy-mode cluster --class sparkExamples.JavaParquetExample1 s3://my-test/RandomJava-1.0-SNAPSHOT.jar --s3://my-test/my-file_0000_part_00.parquet --my-test
Action on failure: Continue
Any help is greatly appreciated!
This seems to be working for me. I am adding a spark application to a cluster with the step name My step name. Let's say you name the file as step-addition.sh. The content of it is following:
#!/bin/bash
set -x
#cluster id
clusterId=$1
startDate=$2
endDate=$3
aws emr add-steps --cluster-id $clusterId --steps Type=Spark,Name='My step name',\
ActionOnFailure=TERMINATE_CLUSTER,Args=[\
"--deploy-mode","cluster","--executor-cores","1","--num-executors","20","--driver-memory","10g","--executor-memory","3g",\
"--class","your-package-structure-like-com.a.b.c.JavaParquetExample1",\
"--master","yarn",\
"--conf","spark.driver.my.custom.config1=my-value-1",\
"--conf","spark.driver.my.custom.config2=my-value-2",\
"--conf","spark.driver.my.custom.config.startDate=${startDate}",\
"--conf","spark.driver.my.custom.config.endDate=${endDate}",\
"s3://my-bucket/my-prefix/path-to-your-actual-application.jar"]
You can execute the above script simply like this:
bash $WORK_DIR/step-addition.sh $clusterId $startDate $endDate
I have an AWS EMR cluster running spark, and I'd like to submit a PySpark job to it from my laptop (--master yarn) to run in cluster mode.
I know that I need to set up some config on the laptop, but I'd like to know what the bare minimum is. Do I just need some of the config files from the master node of the cluster? If so, which? Or do I need to install hadoop or yarn on my local machine?
I've done a fair bit of searching for an answer, but I haven't yet been able to be sure that what I was reading referred to launching a job from the master of the cluster or some arbitrary laptop...
If you want to run the spark-submit job solely on your AWS EMR cluster, you do not need to install anything locally. You only need the EC2 key pair you specified in the Security Options when you created the cluster.
I personally scp over any relevant scripts &/or jars, ssh into the master node of the cluster, and then run spark-submit.
You can specify most of the relevant spark job configurations via spark-submit itself. AWS documents in some more detail how to configure spark-submit jobs.
For example:
>> scp -i ~/PATH/TO/${SSH_KEY} /PATH/TO/PYSPARK_SCRIPT.py hadoop#${PUBLIC_MASTER_DNS}:
>> ssh -i ~/PATH/TO/${SSH_KEY} hadoop#${PUBLIC_MASTER_DNS}
>> spark-submit --conf spark.OPTION.OPTION=VALUE PYSPARK_SCRIPT.py
However, if you already pass a particular configuration when creating the cluster itself, you do not need to re-specify those same configuration options via spark-submit.
You can setup the AWS CLI on your local machine, put your deployment on S3, and then add an EMR step to run on the EMR cluster. Something like this:
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE
Source: https://aws.amazon.com/de/blogs/big-data/submitting-user-applications-with-spark-submit/
If there are 3 steps that are executed in EMR, is it possible to know which step is currently executing, using shell script.
I am able to get current state of the EMR cluster, using this:
`aws emr describe-cluster --cluster-id ${cluster_id} | python -c 'import json,sys;obj=json.load(sys.stdin);print obj["Cluster"]["Status"]["State"]'`
but I am not able to get the information about the name of step currently executing, is it possible.
The list-steps option in the AWS CLI for EMR returns each step. It can be filtered to only return the steps in a certain state: aws emr list-steps --step-states "RUNNING"
(See http://docs.aws.amazon.com/cli/latest/reference/emr/list-steps.html)