using spark with aws cluster - amazon-web-services

I setup a cluster successfully following the instruction here. Just wondering could I invoke Spark via the API with this type of cluster? Where can I find the Spark endpoint(s) detail(s) please? If the aforementioned tutorial is a dead-end, could anyone point me in the right direction please?
My ultimate POC aim is to add 2 columns in a flat file (e.g. csv) in some S3 bucket and compare the resulting values with a third column via spark (this is not a homework (-:) - ideally using Mobius as I am [former] .net dev).

This reference should provide you the information you need. Here is a snippet:
"Go into the ec2 directory in the release of Apache Spark you downloaded.
Run ./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>, where <keypair> is the name of your EC2 key pair (that you gave it when you created it), <key-file> is the private key file for your key pair, <num-slaves> is the number of slave nodes to launch (try 1 at first), and <cluster-name> is the name to give to your cluster.
For example:
export AWS_SECRET_ACCESS_KEY=AaBbCcDdEeFGgHhIiJjKkLlMmNnOoPpQqRrSsTtU
export AWS_ACCESS_KEY_ID=ABCDEFG1234567890123
./spark-ec2 --key-pair=awskey --identity-file=awskey.pem --region=us-west-1 --zone=us-west-1a launch my-spark-cluster
After everything launches, check that the cluster scheduler is up and sees all the slaves by going to its web UI, which will be printed at the end of the script (typically http://master-hostname:8080)."

Related

How to clone an AWS EMR cluster in command line?

I have a recurring task where I need to clone an existing EMR cluster (except with a different name). I have been doing this in the AWS Console (basically, finding the EMR cluster in the console, click "Clone", change the name, then "Create cluster"). Is there a way to do this in command line so that I can automate it? I have checked aws emr create-cluster help but nothing seems relevant. Thanks!
I think this is what you are looking for:
Assuming that you want the cluster to be a clone of the starting state of the original cluster, just create the first EMR cluster from a CloudFormation template and then create new clusters from the same template as needed. Here's an example template.
Cloning a Cluster Using the Console
You can use the Amazon EMR console to clone a cluster, which makes a copy of the configuration of the original cluster to use as the basis for a new cluster.
To clone a cluster using the console
From the Cluster List page, click a cluster to clone.
At the top of the Cluster Details page, click Clone.
In the dialog box, choose Yes to include the steps from the original cluster in the cloned cluster. Choose No to clone the original cluster's configuration without including any of the steps.
Note
For clusters created using AMI 3.1.1 and later (Hadoop 2.x) or AMI 2.4.8 and later (Hadoop 1.x), if you clone a cluster and include steps, all system steps (such as configuring Hive) are cloned along with user-submitted steps, up to 1,000 total. Any older steps that no longer appear in the console's step history cannot be cloned. For earlier AMIs, only 256 steps can be cloned (including system steps). For more information, see Submit Work to a Cluster.
The Create Cluster page appears with a copy of the original cluster's
configuration. Review the configuration, make any necessary changes,
and then click Create Cluster.

Self describe regions with ECS/EC2 Instance

I want to securely fetch configuration files from S3 using a secure VPC from my Docker container. But I want to determine inside the application which configuration file to fetch and use based on the region I am on. Is there a good/best practice to go on about describing the current container's region?
I understand that you can use the AWS SDK/CLI to describe the ECS instances, but that doesn't tell me which one the container is specifically deployed on.
Use the metadata server to query the availability-zone from which you can get the region.
$ curl 169.254.169.254/latest/meta-data/placement/availability-zone/
us-east-1a
One example if you are using python SDK is:
import os
az = os.popen("curl -s http://169.254.169.254/latest/meta-data/placement/availability-zone/").read()
print az[:-1]
>>> us-east-1
Within EC2, you can retrieve the instance metadata using a simple curl command to a local (internal) web API. Region and AZ are some of the data points you can get:
http://169.254.169.254/latest/meta-data/services/domain
http://169.254.169.254/latest/meta-data/placement/availability-zone
See this page for full details about instance metadata.
Within ECS, I'd be interested to see if these might still work -- my hunch is they would, as the container should query the host machine's API for the answer, and the ECS host is most certainly an EC2 instance.
Let us know if that works?

How to change yarn scheduler configuration on aws EMR?

Unlike HortonWorks or Cloudera, AWS EMR does not seem to give any GUI to change xml configurations of various hadoop ecosystem frameworks.
Logging into my EMR namenode and doing a quick
find \ -iname yarn-site.xml
I was able to find it to be located at /etc/hadoop/conf.empty/yarn-site.xml and capacity-scheduler to be located at /etc/hadoop/conf.empty/capacity-scheduler.xml.
But note how these are under conf.empty and I suspect these might not be the actual locations for yarn-site and capacity-scheduler xmls.
I understand that I can change these configurations while making a cluster but what I need to know is how to be able to change them without tearing apart the cluster.
I just want to play around scheduling properties and such and try out different schedulers to identify what might work will with my spark applications.
Thanks in advance!
Well, the yarn-site.xml and capacity-scheduler.xml are indeed under correct locations (/etc/hadoop/conf.empty/) and on running cluster , editing them on master node and restarting YARN RM Daemon will change the scheduler.
When spinning up a new cluster , you can use EMR Configurations API to change appropriate values. http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html
For example : Specify appropriate values in capacity-scheduler and yarn-site classifications on your Configuration for EMR to change those values in corresponding XML files.
Edit: Sep 4, 2019 :
With Amazon EMR version 5.21.0 and later, you can override cluster configurations and specify additional configuration classifications for each instance group in a running cluster. You do this by using the Amazon EMR console, the AWS Command Line Interface (AWS CLI), or the AWS SDK.
Please see
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html

How to run Python Spark code on Amazon Aws?

I have written a python code in spark and I want to run it on Amazon's Elastic Map reduce.
My code works great on my local machine, but I am slightly confused over how to run it on Amazon's AWS?
More specifically, how should I transfer my python code over to the Master node? Do I need to copy my Python code to my s3 bucket and execute it from there? Or, should I ssh into Master and scp my python code to the spark folder in Master?
For now, I tried running the code locally on my terminal and connecting to the cluster address ( I did this by reading the output of --help flag of spark, so I might be missing a few steps here)
./bin/spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.1 \
--master spark://hadoop#ec2-public-dns-of-my-cluster.compute-1.amazonaws.com \
mypythoncode.py
I tried it with and without my permissions file i.e.
-i permissionsfile.pem
However, it fails and the stack trace shows something on the lines of
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
......
......
Is my approach correct and I need to resolve the Access issues to get going or am I heading in a wrong direction?
What is the right way of doing it?
I searched a lot on youtube but couldn't find any tutorials on running Spark on Amazon's EMR.
If it helps, the dataset I am working on it is part of Amazon's public dataset.
go to EMR, create new cluster... [recommendation: start with 1 node only, just for testing purposes].
Click the checkbox to install Spark, you can uncheck the other boxes if you don't need those additional programs.
configure the cluster further by choosing a VPC and a security key (ssh key, a.k.a pem key)
wait for it to boot up. Once your cluster says "waiting", you're free to proceed.
[spark submission via the GUI] in the GUI, you can add a Step and select Spark job, and upload your spark file to S3, and then choose the path to that newly uploaded S3 file. Once it runs it will either succeed or fail. If it fails, wait a moment, and then click "view logs" over on the of that Step line in the list of steps. Keep tweaking your script until you've got it working.
[submission via the command line] SSH into the driver node following the ssh instructions at the top of the page. Once inside, use a command-line text editor to create a new file, and paste the contents of your script in. Then spark-submit yourNewFile.py. If it fails, you'll see the error output straight to the console. Tweak your script, and re-run. Do that until you've got it working as expected.
Note: running jobs from your local machine to a remote machine is troublesome because you may actually be causing your local instance of spark to be responsible for some expensive computations and data transfer over the network. So thats why you want to submit AWS EMR jobs from within EMR.
There are typical two ways to run a job on an Amazon EMR cluster (whether for Spark or other job types):
Login to the master node an run Spark jobs interactively. See: Access the Spark Shell
Submit jobs to the EMR cluster. See: Adding a Spark Step
If you have Apache Zeppelin installed on your EMR cluster, you can use a web browser to interact with Spark.
The error you are experiencing is saying that files where accessed via the s3n: protocol, which requires AWS credentials to be provided. If, instead, the files were accessed via s3:, I suspect that the credentials would be sourced from the IAM Role that is automatically assigned to nodes in the cluster and this error would be resolved.

Deploying multiple Deis clusters

I am looking to create a number of Deis clusters running in parallel on AWS and haven't been able to find any good documentation on how to do so. From what I understand I'd have to do the following:
When provisioning the cluster:
Create a new discovery URL
Give the stack a different name other than the standard "deis" when using the ./provision-aws-cluster.sh script
Create different Deis profiles in $HOME/.deis/client.json that map to each cluster
And when utilizing the deisctl and deis command line interfaces, I need to specify the DEISCTL_TUNNEL and the DEIS_PROFILE each time, respectively.
Am I missing anything? Will this impact my current Deis cluster if I install using the the changes listed above?
That is correct, I don't believe you are missing anything. You should save the cloud-config for each cluster (in contrib/coreos), that will have the discovery url in it and possibly other customizations depending on how your clusters will be configured. If the clusters are going to be different on the AWS side, make sure you save the cloudformation.json file for each as well.