I am trying to start an EMR cluster with Spark using the CLI, where I specify Spark as an application. I also have some bootstrap scripts that configure things like IPython notebooks on top of Spark. However, when I try to refer to common Spark locations in my bootstrap scripts (/usr/bin/spark or /usr/lib/spark/bin) I get not found errors.
Can someone help me understand what the sequence of events in EMR clusters is -- are applications installed after bootstrapping?
So applications are installed during bootstrapping. You can't refer to common Spark locations before Spark is bootstrapped.
Related
Bootstrap actions run before Amazon EMR installs the applications that
you specify when you create the cluster and before cluster nodes begin
processing data. If you add nodes to a running cluster, bootstrap
actions also run on those nodes in the same way. You can create custom
bootstrap actions and specify them when you create your cluster.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-bootstrap.html
i need to patch the application (presto) after it is installed on all nodes. a few possible solutions are
passwordless ssh, but for some security concern we disabled it.
in the bootstrap schedule a cron job and check if the application is installed then act upon it.
use ssm. but never really tried yet.
any idea?
[Update]
what actually has been done in our case is scheduling a background scripts (the &) in the bootstrap scripts which won't block bootstrap. inside the job, it will periodically check if the package is installed or not, if it is installed (e.g. rpm -q presto), then patch it.
I believe you can use EMR steps to do this. Here is a somewhat relevant What is the correct syntax for running a bash script as a step in EMR? description on how to use it.
Update:
You cannot use EMR steps since steps only run on the master.
My understanding is Dev Endpoints in AWS Glue can be used to develop code iteratively and then deploy it to a Glue job. I find this specially useful when developing Spark jobs because every time you run a job, it takes several minutes to launch a Hadoop cluster in the background. However, I am seeing a discrepancy when using Python shell in Glue instead of Spark. Import pg doesn't work in a Dev Endpoint I created using Sagemaker JupyterLab Python notebook, but works in AWS Glue when I create a job using Python shell. Shouldn't the same libraries exist in the dev endpoint that exist in Glue? What is the point of having a dev endpoint if you cannot reproduce the same code in both places (dev endpoint and the Glue job)?
Firstly, Python shell jobs would not launch a Hadooo Cluster in the backend as it does not give you a Spark environment for your jobs.
Secondly, since PyGreSQL is not written in Pure Python, it will not work with Glue's native environment (Glue Spark Job, Dev endpoint etc)
Thirdly, Python Shell has additional support for certain package built-in.
Thus, I don't see a point of using DevEndpoint for Python Shell jobs.
I have launched a hadoop EMR Cluster (5.5.0 - components - Hive, Hue) but not SQOOP. But now i need to have sqoop also to query and dump data from mysql database. Since the cluster is already launched with good amount of data wanted to know if i can also add Sqoop. I dont see this option on AWS Console.
Thanks
I installed it manually, done the required configuration. The limitation i guess now would be that if i have to clone the cluster then it wont be available.
Is there a way to force Amazon EMR to use Spark 1.0.1? The current selectable versions stop at 1.4.1.
I am using the Alternating Least Squares implementation within MLlib, and since v1.1 they have implemented weighted regularization and for specific reasons (research study) I do not want this implementation, rather I am trying to access the non-weighted regularization version they had implemented in v1.0.
I am using Zepplin notebooks with Scala if that helps.
Is working with Zeppelin a requirement? Because if so, it could be very difficult. Zeppelin is compiled against a specific version of Spark so downgrading the jar will most likely fail.
Otherwise, if you are ok with not using Zeppelin and instead using the EMR Step API, then you might be able to spin up an EMR cluster with a bootstrap action that installs spark-assembly 1.0.1. I said it might work, because there's no guarantee that the current EMR version is compatible with a 2 year old version of Spark.
To create the cluster:
Create a cluster from the UI, make sure to uncheck Spark from the additional software menu
Add a custom bootstrap action and use the script at s3://support.elasticmapreduce/spark/install-spark with arguments -v 1.0.1
(See https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark for configuration options)
To run spark using the EMR Step API:
Upload your compiled jar to s3, then submit a step against that cluster
Cluster ID: the id of your cluster (ex j-XXXXXXXX)
Region of cluster. Where you created your EMR cluster. Ex us-west-2
Your spark main class: This is where you put your ml pipeline code.
Your jar: you have to upload the jar with your code to S3 so your cluster can download it
arg1, arg2: arguments to your main (optional)
aws emr add-steps --cluster-id --steps \
Name=SparkPi,Jar=s3://.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn,--class,com.your.spark.class.MainApp,s3://>/your.jar,arg1,arg2],ActionOnFailure=CONTINUE
(Taken from the official github repo at https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/spark-submit-via-step.md)
Also if that fails, install Hadoop and check out https://spark.apache.org/docs/1.0.1/running-on-yarn.html
Or you could also run 1.0.1 locally on your laptop if your data is small.
Good luck.
Amazon EMR provide a list of supported versions of software packages you can install by selecting a drop menu. Nothing stop you from installing additional custom software with a bootstrap action. I had some experience installing java 8 when EMR was supporting only Java 7. It is a bit painful but totally possible.
EMR supports Spark 1.6.0. Take a look at their latest release of emr-4.4.0: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html
I want to orchestrate my EMR jobs. so I thought oozie will be good fit. I have done some POCs on oozie workflow but in local mode, its fairly simple and great.
But I dont understand how to use oozie on EMR cluster.
Based on some search I got to know that aws doesnt come with oozie so we have install it explicitly as a bootstrap action.
Most people point to this link
https://github.com/lila/emr-oozie-sample
But since I am new to aws(EMR) I am still confused how to use it.
It will be great, If anyone can simplify it for me providing some steps or something.
Thanks
I have had some question, which i posted to AWS technical support and i got below reply. I tried it and Oozie is all installed and running with no extra efforts required.
In order to have Oozie installed on an EMR cluster you need to install Hue. The reason is that currently Oozie on EMR is installed as a dependency for Hue. Hue is supported on AMIs 3.3.0 and 3.3.1 as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html. After launching an EMR cluster with Hue -> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hue.html installed you should be able to use Oozie immediately as it will be already configured and started.
EMR 4.x and 5.x series releases now come with Oozie as an optional application. There's also been a recent blog post on the AWS Big Data Blog outlining how to get started with it:
https://blogs.aws.amazon.com/bigdata/post/TxZ4KDBGBMZYJL/Use-Apache-Oozie-Workflows-to-Automate-Apache-Spark-Jobs-and-more-on-Amazon-EMR
That github project installs Oozie as well, so you don't need to take care of it. The configuration for the Oozie installation is in the next link:
https://github.com/lila/emr-oozie-sample/blob/master/config/config-oozie.sh
After that, there are some tasks you can execute from the command shell:
create:
ssh:
sshproxy:
socksproxy:
So, if you follow his instructions you only need to run some of this tasks in order to create and execute an EMR task using Oozie.
For those who are interested, I have cloned the repo and updated the Oozie installer script to support Hadoop 2.4.0 and Oozie 4.0.1
https://github.com/davideanastasia/emr-oozie-sample
Firstly, this is not a direct answer to this question.
EMR integrates with Data Pipeline - Amazon's own scheduler and data workflow orchestrator. Amazon expects you to use Data Pipeline with EMR. It can create, start and terminate EMR clusters, managing cluster lifecycle etc. Evaluate that to see if that fits your needs better..