How to change yarn scheduler configuration on aws EMR? - amazon-web-services

Unlike HortonWorks or Cloudera, AWS EMR does not seem to give any GUI to change xml configurations of various hadoop ecosystem frameworks.
Logging into my EMR namenode and doing a quick
find \ -iname yarn-site.xml
I was able to find it to be located at /etc/hadoop/conf.empty/yarn-site.xml and capacity-scheduler to be located at /etc/hadoop/conf.empty/capacity-scheduler.xml.
But note how these are under conf.empty and I suspect these might not be the actual locations for yarn-site and capacity-scheduler xmls.
I understand that I can change these configurations while making a cluster but what I need to know is how to be able to change them without tearing apart the cluster.
I just want to play around scheduling properties and such and try out different schedulers to identify what might work will with my spark applications.
Thanks in advance!

Well, the yarn-site.xml and capacity-scheduler.xml are indeed under correct locations (/etc/hadoop/conf.empty/) and on running cluster , editing them on master node and restarting YARN RM Daemon will change the scheduler.
When spinning up a new cluster , you can use EMR Configurations API to change appropriate values. http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
For example : Specify appropriate values in capacity-scheduler and yarn-site classifications on your Configuration for EMR to change those values in corresponding XML files.
Edit: Sep 4, 2019 :
With Amazon EMR version 5.21.0 and later, you can override cluster configurations and specify additional configuration classifications for each instance group in a running cluster. You do this by using the Amazon EMR console, the AWS Command Line Interface (AWS CLI), or the AWS SDK.
Please see
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html

Related

How to clone an AWS EMR cluster in command line?

I have a recurring task where I need to clone an existing EMR cluster (except with a different name). I have been doing this in the AWS Console (basically, finding the EMR cluster in the console, click "Clone", change the name, then "Create cluster"). Is there a way to do this in command line so that I can automate it? I have checked aws emr create-cluster help but nothing seems relevant. Thanks!
I think this is what you are looking for:
Assuming that you want the cluster to be a clone of the starting state of the original cluster, just create the first EMR cluster from a CloudFormation template and then create new clusters from the same template as needed. Here's an example template.
Cloning a Cluster Using the Console
You can use the Amazon EMR console to clone a cluster, which makes a copy of the configuration of the original cluster to use as the basis for a new cluster.
To clone a cluster using the console
From the Cluster List page, click a cluster to clone.
At the top of the Cluster Details page, click Clone.
In the dialog box, choose Yes to include the steps from the original cluster in the cloned cluster. Choose No to clone the original cluster's configuration without including any of the steps.
Note
For clusters created using AMI 3.1.1 and later (Hadoop 2.x) or AMI 2.4.8 and later (Hadoop 1.x), if you clone a cluster and include steps, all system steps (such as configuring Hive) are cloned along with user-submitted steps, up to 1,000 total. Any older steps that no longer appear in the console's step history cannot be cloned. For earlier AMIs, only 256 steps can be cloned (including system steps). For more information, see Submit Work to a Cluster.
The Create Cluster page appears with a copy of the original cluster's
configuration. Review the configuration, make any necessary changes,
and then click Create Cluster.

Where to put application property file for spark application running on AWS EMR

I am submitting one spark application job jar to EMR, and it is using some property file. So I can put it into S3 and while creating the EMR I can download it and copy it at some location in EMR box if this is the best way how I can do this while creating the EMR cluster itself at bootstrapping time.
Check following snapshot
In edit software setting you can add your own configuration or JSON file ( which stored on S3 location ) and using this setting you can passed configure parameter to EMR cluster on creating time. For more details please check following links
Amazon EMR Cluster Configurations
Configuring Applications
AWS ClI
hope this will help you.

Disable multipart upload on EMR

The goal is to disable the multipart upload on Amazon EMR.
The guide says enter classification=core-site,properties=[fs.s3.multipart.uploads.enabled=false] in Edit Software Settings when creating the EMR cluster.
My questions are:
Can we modify the configurations for existing EMR cluster? If so, how to do it?
Can we achieve the same goal by putting sparkSession.sparkContext.hadoopConfiguration.set("fs.s3.multipart.uploads.enabled","false") in the jar to be executed on EMR?
Unfortunately, you cannot currently modify configurations on a running EMR cluster, but if it's possible for you to start a new one, you could use the AWS EMR Console to clone your current cluster's configuration then modify the configuration before launching it. (Note: Only the configuration is cloned, not any of the data that may be stored in HDFS or on the cluster instances' local disks.)
However, I believe that what you asked about in your second question will work as intended. Have you tried this and found it not to work?

How to run Python Spark code on Amazon Aws?

I have written a python code in spark and I want to run it on Amazon's Elastic Map reduce.
My code works great on my local machine, but I am slightly confused over how to run it on Amazon's AWS?
More specifically, how should I transfer my python code over to the Master node? Do I need to copy my Python code to my s3 bucket and execute it from there? Or, should I ssh into Master and scp my python code to the spark folder in Master?
For now, I tried running the code locally on my terminal and connecting to the cluster address ( I did this by reading the output of --help flag of spark, so I might be missing a few steps here)
./bin/spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.1 \
--master spark://hadoop#ec2-public-dns-of-my-cluster.compute-1.amazonaws.com \
mypythoncode.py
I tried it with and without my permissions file i.e.
-i permissionsfile.pem
However, it fails and the stack trace shows something on the lines of
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
......
......
Is my approach correct and I need to resolve the Access issues to get going or am I heading in a wrong direction?
What is the right way of doing it?
I searched a lot on youtube but couldn't find any tutorials on running Spark on Amazon's EMR.
If it helps, the dataset I am working on it is part of Amazon's public dataset.
go to EMR, create new cluster... [recommendation: start with 1 node only, just for testing purposes].
Click the checkbox to install Spark, you can uncheck the other boxes if you don't need those additional programs.
configure the cluster further by choosing a VPC and a security key (ssh key, a.k.a pem key)
wait for it to boot up. Once your cluster says "waiting", you're free to proceed.
[spark submission via the GUI] in the GUI, you can add a Step and select Spark job, and upload your spark file to S3, and then choose the path to that newly uploaded S3 file. Once it runs it will either succeed or fail. If it fails, wait a moment, and then click "view logs" over on the of that Step line in the list of steps. Keep tweaking your script until you've got it working.
[submission via the command line] SSH into the driver node following the ssh instructions at the top of the page. Once inside, use a command-line text editor to create a new file, and paste the contents of your script in. Then spark-submit yourNewFile.py. If it fails, you'll see the error output straight to the console. Tweak your script, and re-run. Do that until you've got it working as expected.
Note: running jobs from your local machine to a remote machine is troublesome because you may actually be causing your local instance of spark to be responsible for some expensive computations and data transfer over the network. So thats why you want to submit AWS EMR jobs from within EMR.
There are typical two ways to run a job on an Amazon EMR cluster (whether for Spark or other job types):
Login to the master node an run Spark jobs interactively. See: Access the Spark Shell
Submit jobs to the EMR cluster. See: Adding a Spark Step
If you have Apache Zeppelin installed on your EMR cluster, you can use a web browser to interact with Spark.
The error you are experiencing is saying that files where accessed via the s3n: protocol, which requires AWS credentials to be provided. If, instead, the files were accessed via s3:, I suspect that the credentials would be sourced from the IAM Role that is automatically assigned to nodes in the cluster and this error would be resolved.

Setting UP Spark on existing EC2 cluster

I have to access some big files in buckets in Amazon S3 and do processing on them. For this I was planning to use Apache Spark. I have 2 EC2 instances for this learning project. These are not used but for small crons, so could I use them to install and run Spark? If so, how to install Spark on existing EC2 boxes, so that I can make one master and one slave?
If it helps, I installed Spark in standalone mode on one branch, and the other as well, setting one as Master, and the other as slave. The detailed instructions for the same as I followed are
https://spark.apache.org/docs/1.2.0/spark-standalone.html#installing-spark-standalone-to-a-cluster
See the tutorial on Apache Spark Cluster on EC2 here http://www.supergloo.com/fieldnotes/apache-spark-cluster-amazon-ec2-tutorial/
yes you can create easily a master slave with 2 aws instances just set SPARK_MASTER_IP = instance_privateIP_1 in spark-env.sh on both instances and put instance2 private ip in slaves file in conf folder and these configurations are same on both the machine and other configurations also set like memory core etc. and then you can start it from master, and make sure the spark is install on same location in both the machines.