I started an EMR cluster in order to use test out sqoop but it turns out it doesnt seem to be installed on the latest version of EMR(5.19.0) as I didnt find it in the directory /usr/lib/sqoop. I tried 5.18.0 as well but it was missing there too.
According to the application versions page, sqoop 1.4.7 should be installed on the cluster.
The EMR console gives me a list of 4 "installations". I chose the Core Hadoop package. It has Hive, Hue, etc installed in /usr/lib. Am I missing something here? It's my first time using EMR or sqoop.
I did not see the "Advanced Options" link at the top of the "Create Cluster" page where I can select individual software to install.
When creating an EMR cluster, use the advanced options link where it allows you to select sqoop.
Related
I am facing some issues while trying to integrate Hadoop 3.x version on a Flink cluster. My goal is to use HDFS as a persistent storage and store checkpoints. I am currectly using Flink 1.13.1 and HDFS 3.3.1. The error that I am getting while trying to submit a job is that HDFS is not supported as a file system. In the standalone version, this error was solved by specifying the HADOOP_CLASSPATH on my local machine. As a next step I applied the solution above in all the machines that are used in my cluster and in the standalone version I managed to successfully submit my jobs in all of them without facing any issues. However, when I started modifying the configurations to setup my cluster (by specifying the IPs of my machines) that problem came up once again. What I am missing?
In Hadoop 2.x there are the pre-bundled jar files in the official flink download page that would solve similar issues in the past but that's not the case with Hadoop 3.x versions
It should be enough to set HADOOP_CLASSPATH on every machine in the cluster.
For anyone still struggling with a similar issue, the answer proposed by David worked for me in the end. The detail that I was missing was in the definition of the environment variables.
In my initial attempts, I was using the .bashrc script to permanently define my environment variables. This works in the standalone cluster which is not the case with a distributed cluster due to the scope of the script. What actually worked for me was defining my variables(and $HADOOP_CLASSPATH) in the /etc/profile
I also managed to find another solution while was struggling with HADOOP_CLASSPATH. As I mentioned in my initial post, in Hadoop 2.x there are pre-bundled jar files in the official Flink download page to support HDFS integration, which is not the case in Hadoop 3.x. I found the following maven repository page and after testing all of the existing jars I managed to find one that worked in my case. To be more precise, for Hadoop 3.3.1 the 3.1.1.7.2.8.0-224-9.0 jar (Placed the jar in the $FLINK_HOME/lib) worked. While it is not an "official solution" it seems to solve the issue.
https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-3-uber?repo=cloudera-repos
We have updated emr version for emr-5.30.0.Since then we are getting error in bootstrap.
"Terminated with bootstrap error"
If i change version back to emr-5.29.0 it work fine.I am not able to find reason for bootstrap error.
We are creating EMR cluster from step function.
We have changed version emr-5.29.0 to emr-5.30.0 as we are adding managed autoscalling and it supports only after 5.29.0
I checked logs but could not find any proper error message. Please suggest some pointers to troubleshoot this.
EMR version changes many thing including different applications you select to include like #Snighdhajyoti mentioned for example in emr 5.29.0 spark had version 2.4.4 and in emr 5.30.0 spark has version 2.4.5. You can see the basic list of application changes here.
But the point is, there can be some application or package that you install or configure in bootstrap scipt manually which might be conflicting with the other updated package.
For logs, bootstrap logs dont appear in cluster logs but are in stderr logs for your bootstrap action like below
s3://doc-example-bucket/cluster-id/node/instance-id/bootstrap-actions/
This link provides some more guidance how can you dig down the error, for example
If you can't determine why the script failed after reviewing the
stderr logs, modify your script to provide additional debug
information. For example, set the -ex parameters in the bash script.
This allows you to view the bash script flow in the bootstrap action
log files.
Note: If the failed bootstrap action isn't a bootstrap action that you
created (for example, if you created six bootstrap actions and the
error message is "bootstrap action 7 failed with non-zero exit code"),
it indicates that Amazon EMR couldn't install applications or start
services. This problem is rare. To resolve this issue, try launching
the cluster again.
Is there a way to do it right from a cell in the notebook? similar to pip install ... --upgrade
I didn't know how to do what's instructed on https://docs.qubole.com/en/latest/faqs/general-questions/install-custom-python-libraries.html#pre-installed-python-libraries
The current Python version is 3.5.3, and Pandas 0.20.1. I need to upgrade Pandas, and Matplotlib
In Qubole are two ways to upgrade/install a package for the python environment. Currently there is no interface available inside notebook to install new packages.
New and Recommended Way (via Package Mangement) : User can enable Package Management functionality for an account and add new packages to a cluster via UI. There are lot of advantages of using package management over cluster versions in terms of performance and usability. Refer to https://docs.qubole.com/en/latest/user-guide/package-management/index.html for further details.
Old Way (via bootstrap) : User can configure a bootstrap which is basically a shell script executed on each node when the cluster starts and or upscales (more nodes are getting added to cluster). This can be configured via clusters UI and need a cluster start for every change. This is what is instructed in link you shared.
You cannot download/upgrade packages directly from the cell in the notebook. This is because your notebook is associated to a cluster. Now, to ensure that all the nodes of the cluster have the package installed, you must either use the package management (https://docs.qubole.com/en/latest/user-guide/package-management/package-management-environment.html) or the cluster's node bootstrap (https://docs.qubole.com/en/latest/user-guide/clusters/run-scripts-cluster.html#examples-node-scripts).
Do let me know if you have any further questions.
There is a similar question here - Spark not installed on EMR cluster
But what I am trying to know is - there was .versions folder on AWS EMR cluster before AMI 4 versions e.g. ".versions/2.4.0-amzn-7/etc/hadoop" also there were spark installation folders on /home/hadoop.
Now everything is on /etc/ folder - like /etc/hadoop/conf
Is there any particular reason behind this config. Basically I need to custom bootstrap and I used /home/hadoop previously so I now need to shift to /etc/ ?
Thanks!
Please see the documentation for emr-4.x releases, in particular this page that details these kind of differences from prior versions: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-release-differences.html
I am using Zeppelin sandbox with aws EMR.
Is there a way to download or save the zeppelin notebook in a way so that it can be imported into another Zeppelin server ?
As noted in the comments above, this feature is available starting in version 0.5.6. You can find more details in the release notes. Downloading and installing this version would solve that issue.
Given that you are using EMR, it looks like you will have to work with the version available. As Samuel mentioned above, you can backup the contents of the incubator-zeppelin/notebook folder and make the transfer.