How to install Hadoop 3 on AWS EMR? - amazon-web-services

Hadoop 3 is already 15 months old, and EMR official release labels are still supporting only Hadoop 2.
I couldn't find a quick documentation on how to set up Hadoop 3.1.2 on EMR. Are most people not using it? Seems more difficult than it should be, what am I missing?

EMR did come out with the official support for hadoop 3.1 in September as part of EMR6-beta release.[1]
Also, it includes support for Amazon Linux 2, and Amazon Corretto JDK 8.
[1]EMR6-beta: https://aws.amazon.com/about-aws/whats-new/2019/09/simplify-your-spark-application-dependency-management-with-docker-and-hadoop-3-with-emr-6-0-0-beta/

Related

Problems Integrating Hadoop 3.x on Flink cluster

I am facing some issues while trying to integrate Hadoop 3.x version on a Flink cluster. My goal is to use HDFS as a persistent storage and store checkpoints. I am currectly using Flink 1.13.1 and HDFS 3.3.1. The error that I am getting while trying to submit a job is that HDFS is not supported as a file system. In the standalone version, this error was solved by specifying the HADOOP_CLASSPATH on my local machine. As a next step I applied the solution above in all the machines that are used in my cluster and in the standalone version I managed to successfully submit my jobs in all of them without facing any issues. However, when I started modifying the configurations to setup my cluster (by specifying the IPs of my machines) that problem came up once again. What I am missing?
In Hadoop 2.x there are the pre-bundled jar files in the official flink download page that would solve similar issues in the past but that's not the case with Hadoop 3.x versions
It should be enough to set HADOOP_CLASSPATH on every machine in the cluster.
For anyone still struggling with a similar issue, the answer proposed by David worked for me in the end. The detail that I was missing was in the definition of the environment variables.
In my initial attempts, I was using the .bashrc script to permanently define my environment variables. This works in the standalone cluster which is not the case with a distributed cluster due to the scope of the script. What actually worked for me was defining my variables(and $HADOOP_CLASSPATH) in the /etc/profile
I also managed to find another solution while was struggling with HADOOP_CLASSPATH. As I mentioned in my initial post, in Hadoop 2.x there are pre-bundled jar files in the official Flink download page to support HDFS integration, which is not the case in Hadoop 3.x. I found the following maven repository page and after testing all of the existing jars I managed to find one that worked in my case. To be more precise, for Hadoop 3.3.1 the 3.1.1.7.2.8.0-224-9.0 jar (Placed the jar in the $FLINK_HOME/lib) worked. While it is not an "official solution" it seems to solve the issue.
https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-3-uber?repo=cloudera-repos

What is the Scala and Java version for AWS Glue ETL job?

So far I'm using scala 2.11 with Java 8 to build the library used by the Glue ETL job. We're planning to upgrade to Scala 2.12 with Java 11 but not sure if they are supported by the Glue ETL.
The glue versions are listed here. The latest version supports Spark 2.4.3.
In Spark 2.4.3, the default version of Scala is 2.11.

AWS Elastic Beanstalk Python 3.7 Deployment Location

I'm trying to upgrade an existing application on AWS from the now deprecated Python 3.4 platform to 3.7 with Amazon Linux 2/3.0.1, and in the process I ran into an issue with where the application source code is deployed on the EC2 instance.
From some empirical testing, I found that instead of /opt/python/current/app directory that most if not all AWS documentations say (e.g Troubleshooting issues with the EB CLI - AWS Elastic Beanstalk), with Python 3.7 it is actually deployed in /var/app/current/. I wasn't able to find any documentation regarding this change, and it is causing some issues with the application. I'm wondering is there any reason that this change is made? And if it is possible to revert it, how to do so?
Thanks in advance!
This is because the 3.7 Python Elastic Beanstalk distribution uses Amazon Linux 2 which is fundamentally different from the AMI predecessor. If you opt to use Python 3.6 instead you should be able to avoid this issue, as it runs on the earlier Linux version where deployments still occur in /opt/var/app/current. Most tutorials I've found are designed to work with this older rollout, including the most up-to-date Amazon start guide.
If you have the time, try migrating your code to the newer version, as this seems to be the workflow Amazon is embracing going forward, for all newer versions of Python (such as 3.8 and others yet to come).

Location of Sqoop installation on Amazon EMR cluster?

I started an EMR cluster in order to use test out sqoop but it turns out it doesnt seem to be installed on the latest version of EMR(5.19.0) as I didnt find it in the directory /usr/lib/sqoop. I tried 5.18.0 as well but it was missing there too.
According to the application versions page, sqoop 1.4.7 should be installed on the cluster.
The EMR console gives me a list of 4 "installations". I chose the Core Hadoop package. It has Hive, Hue, etc installed in /usr/lib. Am I missing something here? It's my first time using EMR or sqoop.
I did not see the "Advanced Options" link at the top of the "Create Cluster" page where I can select individual software to install.
When creating an EMR cluster, use the advanced options link where it allows you to select sqoop.

Error while installing (bootstrapping) latest Spark on latest AWS EMR (5.13.X)

I have been trying to install Spark on latest EMR((5.13.X)cluster via bootstrapping using the following with Terraform, but not getting successful. Any ready to use latest Spark/emr version bootable script or other solution to do using Terraform?
bootstrap_action = {
path = "s3://support.elasticmapreduce/spark/install-spark"
name = "install-spark"
args = ["instance.isMaster=true", "echo running on master node"]}
That install-spark bootstrap action hasn't worked since before Spark was officially supported as an application on AMI version 3.9.0 about three years ago. Also, bootstrap actions built for AMI version 3.x and earlier do not work at all with release labels emr-4.x and emr-5.x+.
Instead, to install Spark on emr-4.x or emr-5.x, you simply include "Spark" in the list of Applications of the RunJobFlowRequest.
I have not used Terraform to create an EMR cluster, but the example I found at https://www.terraform.io/docs/providers/aws/r/emr_cluster.html shows exactly how to create a cluster with Spark.