Spark not installed on EMR cluster - amazon-web-services

I have been using Spark on an EMR cluster for a few weeks now without problems - the setup was with the AMI 3.8.0 and Spark 1.3.1, and I passed '-x' as an argument to Spark (without this it didn't seem to be installed).
I want to upgrade to a more recent version of Spark and today spun up a cluster with the emr-4.1.0 AMI, containing Spark 1.5.0. When the cluster is up it claims to have successfully installed Spark (at least on the cluster management page on AWS) but when I ssh into 'hadoop#[IP address]' I don't see anything in the 'hadoop' directory, where in the previous version Spark was installed (I've also tried with other applications and had the same result, and tried to ssh in as ec2-user but Spark is also not installed there). When I spin up the cluster with the emr-4.1.0 AMI I don't have the option to pass the '-x' argument to Spark, and I'm wondering if there is something I'm missing.
Does anyone know what I'm doing wrong here?
Many thanks.

This was actually solved, rather trivially.
In the previous AMI all of the paths to Spark and other applications were soft links available in the hadoop folder. In the newer AMI these have been removed but the applications are still installed and can be accessed by 'spark-shell' (for example) at the command line.

Related

Manual installation of Sqoop 1.4 on AWS EMR 5.5.0

I have launched a hadoop EMR Cluster (5.5.0 - components - Hive, Hue) but not SQOOP. But now i need to have sqoop also to query and dump data from mysql database. Since the cluster is already launched with good amount of data wanted to know if i can also add Sqoop. I dont see this option on AWS Console.
Thanks
I installed it manually, done the required configuration. The limitation i guess now would be that if i have to clone the cluster then it wont be available.

Script to set up Hadoop on EC2

The script to setup Hadoop on EC2 as described in https://wiki.apache.org/hadoop/AmazonEC2 has been removed from recent hadoop release. Google points me to an alternative http://whirr.apache.org/ which also has been retired for more than a year. Is there a replacement or alternative which is still good to set up the latest version of Hadoop on EC2? Thank you!
Update
hadoop-ec2 script has been removed from hadoop src as on 01/11/2011. The intention is to replace it by Apache Whirr. It would be great if the removal could be explicitly documented. Unfortunately, early changelogs are no longer conveniently available on Hadoop official website.
Rather than installing and maintaining Hadoop yourself on an Amazon EC2 instance, you could consider using the Amazon EMR service.
Amazon EMR can automatically deploy a Hadoop cluster and can be triggered via the Management Console, an API call or the AWS Command-Line Interface (CLI).

AWS EMR Spark 1.0

Is there a way to force Amazon EMR to use Spark 1.0.1? The current selectable versions stop at 1.4.1.
I am using the Alternating Least Squares implementation within MLlib, and since v1.1 they have implemented weighted regularization and for specific reasons (research study) I do not want this implementation, rather I am trying to access the non-weighted regularization version they had implemented in v1.0.
I am using Zepplin notebooks with Scala if that helps.
Is working with Zeppelin a requirement? Because if so, it could be very difficult. Zeppelin is compiled against a specific version of Spark so downgrading the jar will most likely fail.
Otherwise, if you are ok with not using Zeppelin and instead using the EMR Step API, then you might be able to spin up an EMR cluster with a bootstrap action that installs spark-assembly 1.0.1. I said it might work, because there's no guarantee that the current EMR version is compatible with a 2 year old version of Spark.
To create the cluster:
Create a cluster from the UI, make sure to uncheck Spark from the additional software menu
Add a custom bootstrap action and use the script at s3://support.elasticmapreduce/spark/install-spark with arguments -v 1.0.1
(See https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark for configuration options)
To run spark using the EMR Step API:
Upload your compiled jar to s3, then submit a step against that cluster
Cluster ID: the id of your cluster (ex j-XXXXXXXX)
Region of cluster. Where you created your EMR cluster. Ex us-west-2
Your spark main class: This is where you put your ml pipeline code.
Your jar: you have to upload the jar with your code to S3 so your cluster can download it
arg1, arg2: arguments to your main (optional)
aws emr add-steps --cluster-id --steps \
Name=SparkPi,Jar=s3://.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn,--class,com.your.spark.class.MainApp,s3://>/your.jar,arg1,arg2],ActionOnFailure=CONTINUE
(Taken from the official github repo at https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/spark-submit-via-step.md)
Also if that fails, install Hadoop and check out https://spark.apache.org/docs/1.0.1/running-on-yarn.html
Or you could also run 1.0.1 locally on your laptop if your data is small.
Good luck.
Amazon EMR provide a list of supported versions of software packages you can install by selecting a drop menu. Nothing stop you from installing additional custom software with a bootstrap action. I had some experience installing java 8 when EMR was supporting only Java 7. It is a bit painful but totally possible.
EMR supports Spark 1.6.0. Take a look at their latest release of emr-4.4.0: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html

Installing Impala 2.3 on Amazon EMR

I see that Impala 2.3 is only supported on Cloudera CDH 5.5 & above. Impala 2.2 can be installed on Amazon EMR as there is Bootstrap script available on GitHub & you don't require Cloudera installation.
However, I don't see any way to install Cloudera CDH 5.5 or 5.6 on Amazon EMR. I want to install Impala 2.3 so is there any way through which Impala 2.3 can be installed on Amazon EMR?
Well, my previous answer has been deleted as long as "does not provide an answer to the question". I'm not going to argue if it's better to have a partially incorrect answer to this question or if making categorical claims without foundation is a good answer :/.
In any case, I'm not giving up :)
Yes, it's possible to install "anything" on the paper.
Once you launch the EMR cluster, all instances will appear on your EC2 console. The only thing is that you have to be careful assigning the right permissions to access thru SSH to your instances. My suggestion is to create a specific security group with the access and assign this extra security group to the instances using the Advanced configuration of the cluster.
By having the proper configuration, you could ssh into any instance and install anything (you should be able to scp any file or download from internet if you have the proper configuration of your VPC). Note that the user will be "hadoop" instead "ec2-root" but this is documented on the EMR user guide.
Keep in mind that the cluster is "Terminated" so, the EMR instances are volatile and the installation is not going to survive the cluster termination.
On the other hand, using the latest versions of EMR AMIs and the latest capabilities of AWS (I think that it was all the time the case, but, it doesn't matter now) you should be able to create some actions on the bootstrap and install anything you want.
Using the "Advanced configuration" of your cluster, you can access to the "Bootstrap" actions to be executed on your cluster. You could even have different actions depending on the node type (master, core, tasks). You should store your scripts (and/or jar files) on an S3 bucket and made this bucket available to your cluster. On the paper, you could install Impala on these EC2 instances comprising the EMR cluster but I'm not sure if this will work.
For more information, you can read http://docs.aws.amazon.com//emr/latest/ManagementGuide/emr-plan-bootstrap.html
And for a previous version of EMR AMI and not so recent version of Impala you can read https://github.com/awslabs/emr-bootstrap-actions/tree/master/impala
Thanks Mark, you forced me to elaborate better my comment.
No, it is not possible to "install" anything on EMR because it's a PaaS provided by AWS. But if your goal is to run a newer version of Impala on AWS, there is an AWS Quick Start path for installing CDH 5.x (including Impala) that makes the process relatively easy.
http://aws.amazon.com/quickstart/

how to run/install oozie in EMR cluster

I want to orchestrate my EMR jobs. so I thought oozie will be good fit. I have done some POCs on oozie workflow but in local mode, its fairly simple and great.
But I dont understand how to use oozie on EMR cluster.
Based on some search I got to know that aws doesnt come with oozie so we have install it explicitly as a bootstrap action.
Most people point to this link
https://github.com/lila/emr-oozie-sample
But since I am new to aws(EMR) I am still confused how to use it.
It will be great, If anyone can simplify it for me providing some steps or something.
Thanks
I have had some question, which i posted to AWS technical support and i got below reply. I tried it and Oozie is all installed and running with no extra efforts required.
In order to have Oozie installed on an EMR cluster you need to install Hue. The reason is that currently Oozie on EMR is installed as a dependency for Hue. Hue is supported on AMIs 3.3.0 and 3.3.1 as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html. After launching an EMR cluster with Hue -> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hue.html installed you should be able to use Oozie immediately as it will be already configured and started.
EMR 4.x and 5.x series releases now come with Oozie as an optional application. There's also been a recent blog post on the AWS Big Data Blog outlining how to get started with it:
https://blogs.aws.amazon.com/bigdata/post/TxZ4KDBGBMZYJL/Use-Apache-Oozie-Workflows-to-Automate-Apache-Spark-Jobs-and-more-on-Amazon-EMR
That github project installs Oozie as well, so you don't need to take care of it. The configuration for the Oozie installation is in the next link:
https://github.com/lila/emr-oozie-sample/blob/master/config/config-oozie.sh
After that, there are some tasks you can execute from the command shell:
create:
ssh:
sshproxy:
socksproxy:
So, if you follow his instructions you only need to run some of this tasks in order to create and execute an EMR task using Oozie.
For those who are interested, I have cloned the repo and updated the Oozie installer script to support Hadoop 2.4.0 and Oozie 4.0.1
https://github.com/davideanastasia/emr-oozie-sample
Firstly, this is not a direct answer to this question.
EMR integrates with Data Pipeline - Amazon's own scheduler and data workflow orchestrator. Amazon expects you to use Data Pipeline with EMR. It can create, start and terminate EMR clusters, managing cluster lifecycle etc. Evaluate that to see if that fits your needs better..