Script to set up Hadoop on EC2 - amazon-web-services

The script to setup Hadoop on EC2 as described in https://wiki.apache.org/hadoop/AmazonEC2 has been removed from recent hadoop release. Google points me to an alternative http://whirr.apache.org/ which also has been retired for more than a year. Is there a replacement or alternative which is still good to set up the latest version of Hadoop on EC2? Thank you!
Update
hadoop-ec2 script has been removed from hadoop src as on 01/11/2011. The intention is to replace it by Apache Whirr. It would be great if the removal could be explicitly documented. Unfortunately, early changelogs are no longer conveniently available on Hadoop official website.

Rather than installing and maintaining Hadoop yourself on an Amazon EC2 instance, you could consider using the Amazon EMR service.
Amazon EMR can automatically deploy a Hadoop cluster and can be triggered via the Management Console, an API call or the AWS Command-Line Interface (CLI).

Related

AWS EC2 API CLI tools vs EC2 Command line tool CLI which one is latest

In my Linux Ubuntu box, I have two variants for tools installed for AWS EC2.
AWS CLI v2 installed # /usr/local/aws-cli/2.4.0
These provide the unified AWS CLI interface that everyone is used to aws ec2 <subcmd>
EC2 API command line tools #/usr/bin/ec2-version
These tools seem to be mer shell scripts wrappers to java invocations of ec2 commands.
The actual Java files seem to be located in a jar ec2-
api-tools-1.6.14.1.jar.
Manage of ec2-version also shows a version of 1.6.14.1 api=2014-05-01
I am trying to write some automation scripts and would like to know which one of these are still supported by AWS. I understand 1st method had two variants AWS CLI v1 and AWS CLI v2, where CLI v1 is deprecated.
Is the 2nd variant EC2-api-tools (Java) also deprecated by Amazon, since the latest version seems to be somewhere in 2014.
Which of these tools versions should I go ahead with my automation?
The AWS Command-Line Interface (CLI) is continuously updated. You should use it.
I do not recognise ec2-api-tools, so I would recommend that you do not use them.

Set Spark version for Sagemaker on Glue Dev Endpoint

To create my Glue scripts, I use development endpoints with Sagemaker notebooks that run the Pyspark (Sparkmagic) kernel.
The latest version of Glue (version 1.0) supports Spark 2.4. However, my Sagemaker notebook uses Spark version 2.2.1.
The function I want to test only exists as of Spark 2.3.
Is there a way to solve this mismatch between the dev endpoint and the Glue job? Can I somehow set the Spark version of the notebook?
I couldn't find anything in the documentation.
When you create a SageMaker notebook for the Glue dev endpoint, it launches a SageMaker notebook instance with a specific lifecycle configuration. This LC provides the configurations to create a connection between the SageMaker notebook and the development endpoint. Upon running cells from the PySpark kernel, the code is sent to the Livy server running in the development endpoint via REST APIs.
Thus, the PySpark version that you see and on which the SageMaker notebook runs depends on the development endpoint and is not configurable from the SageMaker point of view.
Since Glue is a managed service, root access is restricted for the development endpoint. Thus, you cannot update the spark version to a more later version. The feature of using Spark version 2.4 has been newly introduced in Glue and it seems that it has not yet been released for dev endpoint.

AWS EMR Spark 1.0

Is there a way to force Amazon EMR to use Spark 1.0.1? The current selectable versions stop at 1.4.1.
I am using the Alternating Least Squares implementation within MLlib, and since v1.1 they have implemented weighted regularization and for specific reasons (research study) I do not want this implementation, rather I am trying to access the non-weighted regularization version they had implemented in v1.0.
I am using Zepplin notebooks with Scala if that helps.
Is working with Zeppelin a requirement? Because if so, it could be very difficult. Zeppelin is compiled against a specific version of Spark so downgrading the jar will most likely fail.
Otherwise, if you are ok with not using Zeppelin and instead using the EMR Step API, then you might be able to spin up an EMR cluster with a bootstrap action that installs spark-assembly 1.0.1. I said it might work, because there's no guarantee that the current EMR version is compatible with a 2 year old version of Spark.
To create the cluster:
Create a cluster from the UI, make sure to uncheck Spark from the additional software menu
Add a custom bootstrap action and use the script at s3://support.elasticmapreduce/spark/install-spark with arguments -v 1.0.1
(See https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark for configuration options)
To run spark using the EMR Step API:
Upload your compiled jar to s3, then submit a step against that cluster
Cluster ID: the id of your cluster (ex j-XXXXXXXX)
Region of cluster. Where you created your EMR cluster. Ex us-west-2
Your spark main class: This is where you put your ml pipeline code.
Your jar: you have to upload the jar with your code to S3 so your cluster can download it
arg1, arg2: arguments to your main (optional)
aws emr add-steps --cluster-id --steps \
Name=SparkPi,Jar=s3://.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn,--class,com.your.spark.class.MainApp,s3://>/your.jar,arg1,arg2],ActionOnFailure=CONTINUE
(Taken from the official github repo at https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/spark-submit-via-step.md)
Also if that fails, install Hadoop and check out https://spark.apache.org/docs/1.0.1/running-on-yarn.html
Or you could also run 1.0.1 locally on your laptop if your data is small.
Good luck.
Amazon EMR provide a list of supported versions of software packages you can install by selecting a drop menu. Nothing stop you from installing additional custom software with a bootstrap action. I had some experience installing java 8 when EMR was supporting only Java 7. It is a bit painful but totally possible.
EMR supports Spark 1.6.0. Take a look at their latest release of emr-4.4.0: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-whatsnew.html

Installing Impala 2.3 on Amazon EMR

I see that Impala 2.3 is only supported on Cloudera CDH 5.5 & above. Impala 2.2 can be installed on Amazon EMR as there is Bootstrap script available on GitHub & you don't require Cloudera installation.
However, I don't see any way to install Cloudera CDH 5.5 or 5.6 on Amazon EMR. I want to install Impala 2.3 so is there any way through which Impala 2.3 can be installed on Amazon EMR?
Well, my previous answer has been deleted as long as "does not provide an answer to the question". I'm not going to argue if it's better to have a partially incorrect answer to this question or if making categorical claims without foundation is a good answer :/.
In any case, I'm not giving up :)
Yes, it's possible to install "anything" on the paper.
Once you launch the EMR cluster, all instances will appear on your EC2 console. The only thing is that you have to be careful assigning the right permissions to access thru SSH to your instances. My suggestion is to create a specific security group with the access and assign this extra security group to the instances using the Advanced configuration of the cluster.
By having the proper configuration, you could ssh into any instance and install anything (you should be able to scp any file or download from internet if you have the proper configuration of your VPC). Note that the user will be "hadoop" instead "ec2-root" but this is documented on the EMR user guide.
Keep in mind that the cluster is "Terminated" so, the EMR instances are volatile and the installation is not going to survive the cluster termination.
On the other hand, using the latest versions of EMR AMIs and the latest capabilities of AWS (I think that it was all the time the case, but, it doesn't matter now) you should be able to create some actions on the bootstrap and install anything you want.
Using the "Advanced configuration" of your cluster, you can access to the "Bootstrap" actions to be executed on your cluster. You could even have different actions depending on the node type (master, core, tasks). You should store your scripts (and/or jar files) on an S3 bucket and made this bucket available to your cluster. On the paper, you could install Impala on these EC2 instances comprising the EMR cluster but I'm not sure if this will work.
For more information, you can read http://docs.aws.amazon.com//emr/latest/ManagementGuide/emr-plan-bootstrap.html
And for a previous version of EMR AMI and not so recent version of Impala you can read https://github.com/awslabs/emr-bootstrap-actions/tree/master/impala
Thanks Mark, you forced me to elaborate better my comment.
No, it is not possible to "install" anything on EMR because it's a PaaS provided by AWS. But if your goal is to run a newer version of Impala on AWS, there is an AWS Quick Start path for installing CDH 5.x (including Impala) that makes the process relatively easy.
http://aws.amazon.com/quickstart/

how to run/install oozie in EMR cluster

I want to orchestrate my EMR jobs. so I thought oozie will be good fit. I have done some POCs on oozie workflow but in local mode, its fairly simple and great.
But I dont understand how to use oozie on EMR cluster.
Based on some search I got to know that aws doesnt come with oozie so we have install it explicitly as a bootstrap action.
Most people point to this link
https://github.com/lila/emr-oozie-sample
But since I am new to aws(EMR) I am still confused how to use it.
It will be great, If anyone can simplify it for me providing some steps or something.
Thanks
I have had some question, which i posted to AWS technical support and i got below reply. I tried it and Oozie is all installed and running with no extra efforts required.
In order to have Oozie installed on an EMR cluster you need to install Hue. The reason is that currently Oozie on EMR is installed as a dependency for Hue. Hue is supported on AMIs 3.3.0 and 3.3.1 as per http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/ami-versions-supported.html. After launching an EMR cluster with Hue -> http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hue.html installed you should be able to use Oozie immediately as it will be already configured and started.
EMR 4.x and 5.x series releases now come with Oozie as an optional application. There's also been a recent blog post on the AWS Big Data Blog outlining how to get started with it:
https://blogs.aws.amazon.com/bigdata/post/TxZ4KDBGBMZYJL/Use-Apache-Oozie-Workflows-to-Automate-Apache-Spark-Jobs-and-more-on-Amazon-EMR
That github project installs Oozie as well, so you don't need to take care of it. The configuration for the Oozie installation is in the next link:
https://github.com/lila/emr-oozie-sample/blob/master/config/config-oozie.sh
After that, there are some tasks you can execute from the command shell:
create:
ssh:
sshproxy:
socksproxy:
So, if you follow his instructions you only need to run some of this tasks in order to create and execute an EMR task using Oozie.
For those who are interested, I have cloned the repo and updated the Oozie installer script to support Hadoop 2.4.0 and Oozie 4.0.1
https://github.com/davideanastasia/emr-oozie-sample
Firstly, this is not a direct answer to this question.
EMR integrates with Data Pipeline - Amazon's own scheduler and data workflow orchestrator. Amazon expects you to use Data Pipeline with EMR. It can create, start and terminate EMR clusters, managing cluster lifecycle etc. Evaluate that to see if that fits your needs better..