run spark on AWS EMR by passing credentials - amazon-web-services

I am new to EMR and tried to launch Spark job as a step using something like command-runner.jar spark-submit --deploy-mode cluster --class com.xx.xx.className s3n://mybuckets/spark-jobs.jar
However, the spark job needs credentials as environment variables, my question is what is the best way to pass the credentials as environment variables to the spark jobs.
Thanks!

Have a look here: AWS EMR 4.0 - How can I add a custom JAR step to run shell commands and here: http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hadoop-script.html
try running step like this(arguments): /usr/bin/spark-submit --deploy-mode cluster --class

I came to your question googling the solution for myself. Right now as a temp solution, I am passing the credential as cmd line parameters. In the future I am thinking to add a custom bootstrap script which will fetch data from service and create the ~/.aws/credentials and config files.
I hope this helps or if you have discovered any other option do post here.

Related

how could I launch Trino in AWS EMR?

When I create the EMR cluster, in the 'Application' step I have chosen 'Trino'. I can confirm this.
When I connect to the Master Node using SSH, and type 'presto --version' they give me 'presto:command not found'.
Also tried 'presto-cli' as EMR docs said, still got 'presto-cli' not found.
Also,as Trino Docs, I should go to the 'bin/launcher' directory and launch trino. However, I do not know where is this in my Cluster.
I am a noob. Could you please tell me how to run/launch trino in AWS EMR?
If you're looking to startup the Trino CLI, you should use 'trino-cli'. (Recall that PrestoSQL was renamed to Trino a couple of years ago which triggered a whole host of renaming both inside and out.)

How to run glue script from Glue Dev Endpoint

I have a glue script (test.py) written say in a editor. I connected to glue dev endpoint and copied the script to endpoint or I can store in S3 bucket. Basically glue endpoint is an EMR cluster, now how can I run the script from the dev endpoint terminal? Can I use spark-submit and run it ?
I know we can run it from glue console,but more interested to know if I can run it from glue end point terminal.
You don't need a notebook; you can ssh to the dev endpoint and run it with the gluepython interpreter (not plain python).
e.g.
radix#localhost:~$ DEV_ENDPOINT=glue#ec2-w-x-y-z.compute-1.amazonaws.com
radix#localhost:~$ scp myscript.py $DEV_ENDPOINT:/home/glue/myscript.py
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT
...
[glue#ip-w-x-y-z ~]$ gluepython myscript.py
You can also run the script directly without getting an interactive shell with ssh (of course, after uploading the script with scp or whatever):
radix#localhost:~$ ssh -i {private-key} $DEV_ENDPOINT gluepython myscript.py
If this is a script that uses the Job class (as the auto-generated Python scripts do), you may need to pass --JOB_NAME and --TempDir parameters.
For development / testing purpose, you can setup a zeppelin notebook locally, have an SSH connection established using the AWS Glue endpoint URL, so you can have access to the data catalog/crawlers,etc. and also the s3 bucket where your data resides.
After all the testing is completed, you can bundle your code, upload to an S3 bucket. Then create a Job pointing to the ETL script in S3 bucket, so that the job can be run, and scheduled as well.
Please refer here and setting up zeppelin on windows, for any help on setting up local environment. You can use dev instance provided by Glue, but you may incur additional costs for the same(EC2 instance charges).
Once you set up the zeppelin notebook, you can copy the script(test.py) to the zeppelin notebook, and run from the zeppelin.
According to AWS Glue FAQ:
Q: When should I use AWS Glue vs. Amazon EMR?
AWS Glue works on top of the Apache Spark environment to provide a
scale-out execution environment for your data transformation jobs. AWS
Glue infers, evolves, and monitors your ETL jobs to greatly simplify
the process of creating and maintaining jobs. Amazon EMR provides you
with direct access to your Hadoop environment, affording you
lower-level access and greater flexibility in using tools beyond
Spark.
Do you have any specific requirement to run Glue script in an EMR instance? Since in my opinion, EMR gives more flexibility and you can use any 3rd party python libraries and run directly in a EMR Spark cluster.
Regards

How to connect Jenkins and aws dynamoDB

I want to store values from Environment Variables of Jenkins to AWS DynamoDb. I request anyone of you could help me how to connect Jenkins and DynamoDb either using manual configuration or using Jenkins shell command.
Thank you in advance
You can use AWS CLI for this, launching the command from your Jenkins pipeline code.
More info here:
Using the AWS CLI with DynamoDB

How to run a Spark jar file from AWS Console without Spark-Shell

I'm trying to run a Spark application on the AWS EMR Console (Amazon Web Services). My Scala script compiled in the jar takes the SparkConf settings as parameters or just strings:
val sparkConf = new SparkConf()
.setAppName("WikipediaGraphXPageRank")
.setMaster(args(1))
.set("spark.executor.memory","1g")
.registerKryoClasses(Array(classOf[PRVertex], classOf[PRMessage]))
However, I don't know how to pass the Master-URL parameter and other parameters to the jar when it's uploaded and I set-up the cluster. To be clear, I'm aware that if I was running the Spark-Shell I would do this another way, but I'm a Windows user and with the current set-up and work I've done, it would be very useful to have some way to pass the master URL to the EMR cluster in the 'steps'.
I don't want to use the Spark-Shell, I have a close deadline and have everything set-up this way and feels like just this small issue of passing the master URL as a parameter should be possible, considering that AWS have a guide for running stand-alone Spark applications on EMR.
Help would be appreciated!
Here are instructions on using spark-submit via EMR Step: https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/spark-submit-via-step.md

how to run/install oozie in EMR cluster

I want to orchestrate my EMR jobs. so I thought oozie will be good fit. I have done some POCs on oozie workflow but in local mode, its fairly simple and great.
But I dont understand how to use oozie on EMR cluster.
Based on some search I got to know that aws doesnt come with oozie so we have install it explicitly as a bootstrap action.
Most people point to this link
https://github.com/lila/emr-oozie-sample
But since I am new to aws(EMR) I am still confused how to use it.
It will be great, If anyone can simplify it for me providing some steps or something.
Thanks
I have had some question, which i posted to AWS technical support and i got below reply. I tried it and Oozie is all installed and running with no extra efforts required.
In order to have Oozie installed on an EMR cluster you need to install Hue. The reason is that currently Oozie on EMR is installed as a dependency for Hue. Hue is supported on AMIs 3.3.0 and 3.3.1 as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html. After launching an EMR cluster with Hue -> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hue.html installed you should be able to use Oozie immediately as it will be already configured and started.
EMR 4.x and 5.x series releases now come with Oozie as an optional application. There's also been a recent blog post on the AWS Big Data Blog outlining how to get started with it:
https://blogs.aws.amazon.com/bigdata/post/TxZ4KDBGBMZYJL/Use-Apache-Oozie-Workflows-to-Automate-Apache-Spark-Jobs-and-more-on-Amazon-EMR
That github project installs Oozie as well, so you don't need to take care of it. The configuration for the Oozie installation is in the next link:
https://github.com/lila/emr-oozie-sample/blob/master/config/config-oozie.sh
After that, there are some tasks you can execute from the command shell:
create:
ssh:
sshproxy:
socksproxy:
So, if you follow his instructions you only need to run some of this tasks in order to create and execute an EMR task using Oozie.
For those who are interested, I have cloned the repo and updated the Oozie installer script to support Hadoop 2.4.0 and Oozie 4.0.1
https://github.com/davideanastasia/emr-oozie-sample
Firstly, this is not a direct answer to this question.
EMR integrates with Data Pipeline - Amazon's own scheduler and data workflow orchestrator. Amazon expects you to use Data Pipeline with EMR. It can create, start and terminate EMR clusters, managing cluster lifecycle etc. Evaluate that to see if that fits your needs better..