spark-submit from outside AWS EMR cluster - amazon-web-services

I have an AWS EMR cluster running spark, and I'd like to submit a PySpark job to it from my laptop (--master yarn) to run in cluster mode.
I know that I need to set up some config on the laptop, but I'd like to know what the bare minimum is. Do I just need some of the config files from the master node of the cluster? If so, which? Or do I need to install hadoop or yarn on my local machine?
I've done a fair bit of searching for an answer, but I haven't yet been able to be sure that what I was reading referred to launching a job from the master of the cluster or some arbitrary laptop...

If you want to run the spark-submit job solely on your AWS EMR cluster, you do not need to install anything locally. You only need the EC2 key pair you specified in the Security Options when you created the cluster.
I personally scp over any relevant scripts &/or jars, ssh into the master node of the cluster, and then run spark-submit.
You can specify most of the relevant spark job configurations via spark-submit itself. AWS documents in some more detail how to configure spark-submit jobs.
For example:
>> scp -i ~/PATH/TO/${SSH_KEY} /PATH/TO/PYSPARK_SCRIPT.py hadoop#${PUBLIC_MASTER_DNS}:
>> ssh -i ~/PATH/TO/${SSH_KEY} hadoop#${PUBLIC_MASTER_DNS}
>> spark-submit --conf spark.OPTION.OPTION=VALUE PYSPARK_SCRIPT.py
However, if you already pass a particular configuration when creating the cluster itself, you do not need to re-specify those same configuration options via spark-submit.

You can setup the AWS CLI on your local machine, put your deployment on S3, and then add an EMR step to run on the EMR cluster. Something like this:
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE
Source: https://aws.amazon.com/de/blogs/big-data/submitting-user-applications-with-spark-submit/

Related

How to add an EMR Spark Step?

According to the docs:
For Step type, choose Spark application.
But in Amazon EMR -> Clusters -> mycluster -> Steps -> Add step -> Step type, the only options are:
There are two ways to add EMR spark steps:
- Using command-runner.jar (custom application)
spark-submit --class org.apache.spark.examples.SparkPi /usr/lib/spark/examples/jars/spark-examples.jar 10
Using aws cli to do the same
aws emr add-steps --cluster-id j-xxxxxxxx --steps Name="add emr step to run spark",Jar="command-runner.jar",Args=[spark-submit,--class,org.apache.spark.examples.SparkPi,/usr/lib/spark/examples/jars/spark-examples.jar,10]
- Spark Application
I don't have a Spark Application option because I created a Core Hadoop cluster.
When I created the cluster, under Software configuration, I should have chosen Spark, then I would have had the Spark application option under Step type.
You can use command-runner.jar for your use case. For the step type let it be Custom Jar from the options that you have.
Check out this image for detail.
You can read more about command-runner.jar command-runner-usage

Import manually created K8s cluster into KOps

It's been sometime I've visited all the web pages carrying word "KOps import" but did not find a way to import my manually created K8s cluster. Manually created cluster means "Deployed Infra on AWS using Terraform and Kubernetes using Terraform's provisioner script as Shell script". Now as I see managing the environment manually is a pain, I look forward to move it under KOps. For that I have done the following so far:
Installed aws cli, kubectl and kops in my local machine.
Created KOps user with policies AmazonEC2FullAccess,
AmazonRoute53FullAccess, AmazonS3FullAccess, IAMFullAccess,
AmazonVPCFullAccess and generated access and secret keys.
Configured credentials using aws configure.
Created S3 bucket to store state.
Set env variables like Region and Cluster name.
Finally, ran kops import command as below:
kops import cluster --region ${REGION} --name ${OLD_NAME}
But encountered below error:
Cluster.kops "jjm-prod-use1-kubernetes" not found
Verbosed:
$ kops import cluster --region ${REGION} --name ${OLD_NAME} -v 10
I0131 16:32:12.059651 25683 factory.go:68] state store s3://kops-state-store-jjm
I0131 16:32:13.133145 25683 s3context.go:194] found bucket in region "us-east-1"
I0131 16:32:13.133174 25683 s3fs.go:220] Reading file "s3://kops-state-store-jjm/jjm-prod-use1-kubernetes/config"
Which made me serious about posting this question. Is there any possible way where a K8s cluster created except using kubeup.sh can be brought under the control of KOps ? Please advise.
Note: There's no way I can re-create (destroy and create) the clusters as they are running in production.
EDIT: I know this can be achieved only the cluster was setup using kubeup.sh. But is there any other way ?
That is only possible with cluster bootstrapped via kube-up.sh script as officialy announced in Kops documentation pages. Actually, kube-up.sh has been excluded from the list of supported Kubernetes installation tools for AWS. Although, cluster composed by kube-up.sh provides a lot of customization settings which are specifically applicable to AWS, the initial script uses environmental variables to define these settings. Therefore, I assume that it's quite hard to achieve in your case.

Run cron task on AWS EMR master node

How can i run periodic job in background on EMR cluster?
I have script.sh with cron job and application.py in s3 and want to run cluster with this command:
aws emr create-cluster
--name "Test cluster"
–-release-label emr-5.12.0
--applications Name=Hive Name=Pig Name=Ganglia Name=Spark
--use-default-roles
--ec2-attributes KeyName=myKey
--instance-type m3.xlarge
--instance-count 3
--steps Type=CUSTOM_JAR,Name=CustomJAR,ActionOnFailure=CONTINUE,
Jar=s3://region.elasticmapreduce/libs/script-runner/script-runner.jar,
Args["s3://mybucket/script-path/script.sh"]
Finally, i want that cron job from script.sh execute application.py
Now i don't understand how to install cron on master node, python file need some libraries, they should be installed to.
You need to SSH into the master node, and then perform the crontab setup from there, not on your local machine:
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-connect-master-node-ssh.html
Connect to the Master Node Using SSH
Secure Shell (SSH) is a network protocol you can use to create a
secure connection to a remote computer. After you make a connection,
the terminal on your local computer behaves as if it is running on the
remote computer. Commands you issue locally run on the remote
computer, and the command output from the remote computer appears in
your terminal window.
When you use SSH with AWS, you are connecting to an EC2 instance,
which is a virtual server running in the cloud. When working with
Amazon EMR, the most common use of SSH is to connect to the EC2
instance that is acting as the master node of the cluster.
Using SSH to connect to the master node gives you the ability to
monitor and interact with the cluster. You can issue Linux commands on
the master node, run applications such as Hive and Pig interactively,
browse directories, read log files, and so on. You can also create a
tunnel in your SSH connection to view the web interfaces hosted on the
master node. For more information, see View Web Interfaces Hosted on
Amazon EMR Clusters.
By default crontab is installed in linux system, you don't need to install manually.
To add spark job scheduling in cron, please follow below steps
Login into master node (SSH into master).
run command
crontab -e
Add the below line in crontab and save it ( :w)
*/15 0 * * * /script-path/script.sh
Now cron will schedule the jobs for every 15 minutes.
Please refer this link to know about cron.
Hope this helps.
Thanks
Ravi

Can I run a job on EMR like on my local cluster

I have build a local cluster on my laptop (pseudo mode). Where I run different mapreduce commands like
hadoop-streaming -D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-files my_mapper.py,my_reducer.py \
-mapper my_mapper.py \
-reducer my_reducer.py \
-input /aws/input/input_warc.txt \
-output /aws/output
Now I have to run it on EMR. There are two options that can be used one is console and second is aws cli. I want to run exactly comands like above. For that, I think if I ssh to EMR master, then I should be able to run this command. Is it a right way or is there any drawback of this approch ?
Yes, you may SSH to your cluster and run your jobs there, but you may also use the Step API (http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-steps.html) to run arbitrary commands on the master instance, including of course running distributed jobs like your example. You may add Steps to a cluster using the AWS CLI ("aws emr add-step ..." or also during cluster creation using "aws emr create-cluster ... --steps ...") or similarly using the AWS SDKs (like the AWS Java SDK) or using the AWS EMR Console.
Some advantages of the Step API include that it captures the output of each step so that you can view it via the AWS CLI, SDK, or AWS Console, and you can also check the status of Steps to determine when they have completed.
One disadvantage of the Step API is that currently Steps all run sequentially, so you can't have multiple Steps running in parallel.

Enable debugging/logging on running EMR cluster

I forgot to add the following two options when creating my EMR cluster with the AWS CLI:
--log-uri s3://mybucket/logs/ --enable-debugging
Is there a way to add a log uri and enable debugging on a running cluster?
You can not do enable debugging to running cluster.
The logs to s3 also needs to be enabled while you create cluster
By default, each cluster writes log files on the master node. These are written to the /mnt/var/log/ directory. You can access them by using SSH to connect to the master node
For more details please check here.