How to add environment variables to an EMR cluster - amazon-web-services

How to add environment variables to an EMR cluster.
Currently, I have added them in a .sh file and was using script-runner.jar to run the script.
#!/bin/bash
export PYSPARK_PYTHON=/home/hadoop/bin/python
export PYSPARK_DRIVER_PYTHON=/home/hadoop/bin/python
Like this I was submitting the script as mentioned here:
aws emr add-steps \
--cluster-id j-2AXXXXXXGAPLF \
--steps Type=CUSTOM_JAR,Name="Run a script from S3 with script-runner.jar",ActionOnFailure=CONTINUE,Jar=s3://us-west-2.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[s3://mybucket/my-script.sh]
I have also tried using command-runner.jar. Both the approaches did not work. Can you suggest some other approach to add env variables to the cluster remotely/from an EC2 instance?

Related

aws emr add-steps a spark application

I'd like to add a step as a spark application using AWS CLI, but I cannot find a working command, from AWS official doc: https://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html, they listed out 6 examples, none of them is for spark.
But I could configure it through AWS Console UI and it runs fine, but for efficiency, I'd like to be able to do so via aws cli.
The closest that I could come up with is this command:
aws emr add-steps --cluster-id j-cluster-id --steps Type=SPARK,Name='SPARK APP',ActionOnFailure=CONTINUE,Jar=s3://my-test/RandomJava-1.0-SNAPSHOT.jar,MainClass=JavaParquetExample1,Args=s3://my-test/my-file_0000_part_00.parquet,my-test --profile my-test --region us-west-2
but this resulted in this configuration on AWS EMR step:
JAR location : command-runner.jar
Main class : None
Arguments : spark-submit s3://my-test/my-file_0000_part_00.parquet my-test
Action on failure: Continue
which resulted in failure.
The correct one (completed successfully, configured through AWS Console UI) looks like this:
JAR location : command-runner.jar
Main class : None
Arguments : spark-submit --deploy-mode cluster --class sparkExamples.JavaParquetExample1 s3://my-test/RandomJava-1.0-SNAPSHOT.jar --s3://my-test/my-file_0000_part_00.parquet --my-test
Action on failure: Continue
Any help is greatly appreciated!
This seems to be working for me. I am adding a spark application to a cluster with the step name My step name. Let's say you name the file as step-addition.sh. The content of it is following:
#!/bin/bash
set -x
#cluster id
clusterId=$1
startDate=$2
endDate=$3
aws emr add-steps --cluster-id $clusterId --steps Type=Spark,Name='My step name',\
ActionOnFailure=TERMINATE_CLUSTER,Args=[\
"--deploy-mode","cluster","--executor-cores","1","--num-executors","20","--driver-memory","10g","--executor-memory","3g",\
"--class","your-package-structure-like-com.a.b.c.JavaParquetExample1",\
"--master","yarn",\
"--conf","spark.driver.my.custom.config1=my-value-1",\
"--conf","spark.driver.my.custom.config2=my-value-2",\
"--conf","spark.driver.my.custom.config.startDate=${startDate}",\
"--conf","spark.driver.my.custom.config.endDate=${endDate}",\
"s3://my-bucket/my-prefix/path-to-your-actual-application.jar"]
You can execute the above script simply like this:
bash $WORK_DIR/step-addition.sh $clusterId $startDate $endDate

What is the equivalent to eb setenv in aws cli?

What is the equivalent command of eb setenv in AWS cli?
I tried option_settings but seems like it only holds the namespaces not the random variables.
Note: I do not want to set them on aws web interface or .ebextensions config files.
Use the aws elasticbeanstalk update-environment command.
From the examples on this page from the AWS CLI docs:
To set an environment variable
The following command sets the value of the "PARAM1" variable in the
"my-env" environment to "ParamValue":
aws elasticbeanstalk update-environment --environment-name my-env
--option-settings Namespace=aws:elasticbeanstalk:application:environment,OptionName=PARAM1,Value=ParamValue

spark-submit from outside AWS EMR cluster

I have an AWS EMR cluster running spark, and I'd like to submit a PySpark job to it from my laptop (--master yarn) to run in cluster mode.
I know that I need to set up some config on the laptop, but I'd like to know what the bare minimum is. Do I just need some of the config files from the master node of the cluster? If so, which? Or do I need to install hadoop or yarn on my local machine?
I've done a fair bit of searching for an answer, but I haven't yet been able to be sure that what I was reading referred to launching a job from the master of the cluster or some arbitrary laptop...
If you want to run the spark-submit job solely on your AWS EMR cluster, you do not need to install anything locally. You only need the EC2 key pair you specified in the Security Options when you created the cluster.
I personally scp over any relevant scripts &/or jars, ssh into the master node of the cluster, and then run spark-submit.
You can specify most of the relevant spark job configurations via spark-submit itself. AWS documents in some more detail how to configure spark-submit jobs.
For example:
>> scp -i ~/PATH/TO/${SSH_KEY} /PATH/TO/PYSPARK_SCRIPT.py hadoop#${PUBLIC_MASTER_DNS}:
>> ssh -i ~/PATH/TO/${SSH_KEY} hadoop#${PUBLIC_MASTER_DNS}
>> spark-submit --conf spark.OPTION.OPTION=VALUE PYSPARK_SCRIPT.py
However, if you already pass a particular configuration when creating the cluster itself, you do not need to re-specify those same configuration options via spark-submit.
You can setup the AWS CLI on your local machine, put your deployment on S3, and then add an EMR step to run on the EMR cluster. Something like this:
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE
Source: https://aws.amazon.com/de/blogs/big-data/submitting-user-applications-with-spark-submit/

Can I run a job on EMR like on my local cluster

I have build a local cluster on my laptop (pseudo mode). Where I run different mapreduce commands like
hadoop-streaming -D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-files my_mapper.py,my_reducer.py \
-mapper my_mapper.py \
-reducer my_reducer.py \
-input /aws/input/input_warc.txt \
-output /aws/output
Now I have to run it on EMR. There are two options that can be used one is console and second is aws cli. I want to run exactly comands like above. For that, I think if I ssh to EMR master, then I should be able to run this command. Is it a right way or is there any drawback of this approch ?
Yes, you may SSH to your cluster and run your jobs there, but you may also use the Step API (http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-steps.html) to run arbitrary commands on the master instance, including of course running distributed jobs like your example. You may add Steps to a cluster using the AWS CLI ("aws emr add-step ..." or also during cluster creation using "aws emr create-cluster ... --steps ...") or similarly using the AWS SDKs (like the AWS Java SDK) or using the AWS EMR Console.
Some advantages of the Step API include that it captures the output of each step so that you can view it via the AWS CLI, SDK, or AWS Console, and you can also check the status of Steps to determine when they have completed.
One disadvantage of the Step API is that currently Steps all run sequentially, so you can't have multiple Steps running in parallel.

Reading S3 data from Google's dataproc

I'm running a pyspark application through Google's dataproc on a cluster I created. In one stage, the application needs to access a directory in an Amazon S3 directory. At that stage, I get the error:
AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).
I logged onto the headnode of the cluster and set the /etc/boto.cfg with my AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY information, but that didn't solve the access issue.
(1) Any other suggestions for how to access AWS S3 from a dataproc cluster?
(2) Also, what is the name of the user that dataproc uses to access the cluster? If I knew that, I could set the ~/.aws directory on the cluster for that user.
Thanks.
Since you're using the Hadoop/Spark interfaces (like sc.textFile), everything should indeed be done through the fs.s3.* or fs.s3n.* or fs.s3a.* keys rather than trying to wire through any ~/.aws or /etc/boto.cfg settings. There are a few ways you can plumb those settings through to your Dataproc cluster:
At cluster creation time:
gcloud dataproc clusters create --properties \
core:fs.s3.awsAccessKeyId=<s3AccessKey>,core:fs.s3.awsSecretAccessKey=<s3SecretKey> \
--num-workers ...
The core prefix here indicates you want the settings to be placed in the core-site.xml file, as explained in the Cluster Properties documentation.
Alternatively, at job-submission time, if you're using Dataproc's APIs:
gcloud dataproc jobs submit pyspark --cluster <your-cluster> \
--properties spark.hadoop.fs.s3.awsAccessKeyId=<s3AccessKey>,spark.hadoop.fs.s3.awsSecretAccessKey=<s3SecretKey> \
...
In this case, we're passing the properties through as Spark properties, and Spark provides a handy mechanism to define "hadoop" conf properties as a subset of Spark conf, simply using the spark.hadoop.* prefix. If you're submitting at the command line over SSH, this is equivalent to:
spark-submit --conf spark.hadoop.fs.s3.awsAccessKeyId=<s3AccessKey> \
--conf spark.hadoop.fs.s3.awsSecretAccessKey=<s3SecretKey>
Finally, if you want to set it up at cluster creation time but prefer not to have your access keys explicitly set in your Dataproc metadata, you might opt to use an initialization action instead. There's a handy tool called bdconfig that should be present on the path with which you can modify XML settings easily:
#!/bin/bash
# Create this shell script, name it something like init-aws.sh
bdconfig set_property \
--configuration_file /etc/hadoop/conf/core-site.xml \
--name 'fs.s3.awsAccessKeyId' \
--value '<s3AccessKey>' \
--clobber
bdconfig set_property \
--configuration_file /etc/hadoop/conf/core-site.xml \
--name 'fs.s3.awsSecretAccessKey' \
--value '<s3SecretKey>' \
--clobber
Upload that to a GCS bucket somewhere, and use it at cluster creation time:
gsutil cp init-aws.sh gs://<your-bucket>/init-aws.sh
gcloud dataproc clustres create --initialization-actions \
gs://<your-bucket>/init-aws.sh
While Dataproc metadata is indeed encrypted at rest and heavily secured just like any other user data, using the init action instead helps prevent inadvertently showing your access key/secret for example to someone standing behind your screen when viewing your Dataproc cluster properties.
You can try with setting the AWS config, while initialization of sparkContext.
conf = < your SparkConf()>
sc = SparkContext(conf=conf)
sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", <s3AccessKey>)
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey", <s3SecretKey>)