Zeppelin: how to download or save the Zeppelin notebook? - amazon-web-services

I am using Zeppelin sandbox with aws EMR.
Is there a way to download or save the zeppelin notebook in a way so that it can be imported into another Zeppelin server ?

As noted in the comments above, this feature is available starting in version 0.5.6. You can find more details in the release notes. Downloading and installing this version would solve that issue.
Given that you are using EMR, it looks like you will have to work with the version available. As Samuel mentioned above, you can backup the contents of the incubator-zeppelin/notebook folder and make the transfer.

Related

What are the exact steps of setting up jfrog artifactory pro on AWS?

I am looking for the exact steps for setting up jfrog artifactory pro on AWS and then accessing it from the browser(Browser access must be only from inside the corporation network). I am following the steps from
https://www.devopsschool.com/blog/artifactory-install-and-configurations-guide/
https://github.com/ravdy/DevOps/blob/master/Artifactory/Setup_Artifactory.md
Do I need to setup a reverse proxy? If so the steps of doing that too would be helpful.
I am very new to AWS and jfrog artifactory and reverse proxy stuff( 1 week experience in all these). So I am not able to find the correct resource to get it done.
The best is to use the official installation steps as provided here: https://www.jfrog.com/confluence/display/JFROG/Installing+Artifactory
If you want to use air gapped environment, suggest downloading once into the network a Linux Archive installation and go ahead with install. If Docker is a possibility, download the images locally and install using docker compose.

Problems Integrating Hadoop 3.x on Flink cluster

I am facing some issues while trying to integrate Hadoop 3.x version on a Flink cluster. My goal is to use HDFS as a persistent storage and store checkpoints. I am currectly using Flink 1.13.1 and HDFS 3.3.1. The error that I am getting while trying to submit a job is that HDFS is not supported as a file system. In the standalone version, this error was solved by specifying the HADOOP_CLASSPATH on my local machine. As a next step I applied the solution above in all the machines that are used in my cluster and in the standalone version I managed to successfully submit my jobs in all of them without facing any issues. However, when I started modifying the configurations to setup my cluster (by specifying the IPs of my machines) that problem came up once again. What I am missing?
In Hadoop 2.x there are the pre-bundled jar files in the official flink download page that would solve similar issues in the past but that's not the case with Hadoop 3.x versions
It should be enough to set HADOOP_CLASSPATH on every machine in the cluster.
For anyone still struggling with a similar issue, the answer proposed by David worked for me in the end. The detail that I was missing was in the definition of the environment variables.
In my initial attempts, I was using the .bashrc script to permanently define my environment variables. This works in the standalone cluster which is not the case with a distributed cluster due to the scope of the script. What actually worked for me was defining my variables(and $HADOOP_CLASSPATH) in the /etc/profile
I also managed to find another solution while was struggling with HADOOP_CLASSPATH. As I mentioned in my initial post, in Hadoop 2.x there are pre-bundled jar files in the official Flink download page to support HDFS integration, which is not the case in Hadoop 3.x. I found the following maven repository page and after testing all of the existing jars I managed to find one that worked in my case. To be more precise, for Hadoop 3.3.1 the 3.1.1.7.2.8.0-224-9.0 jar (Placed the jar in the $FLINK_HOME/lib) worked. While it is not an "official solution" it seems to solve the issue.
https://mvnrepository.com/artifact/org.apache.flink/flink-shaded-hadoop-3-uber?repo=cloudera-repos

How do I upgrade a library in Qubole's Jupyter Notebook, using PySpark?

Is there a way to do it right from a cell in the notebook? similar to pip install ... --upgrade
I didn't know how to do what's instructed on https://docs.qubole.com/en/latest/faqs/general-questions/install-custom-python-libraries.html#pre-installed-python-libraries
The current Python version is 3.5.3, and Pandas 0.20.1. I need to upgrade Pandas, and Matplotlib
In Qubole are two ways to upgrade/install a package for the python environment. Currently there is no interface available inside notebook to install new packages.
New and Recommended Way (via Package Mangement) : User can enable Package Management functionality for an account and add new packages to a cluster via UI. There are lot of advantages of using package management over cluster versions in terms of performance and usability. Refer to https://docs.qubole.com/en/latest/user-guide/package-management/index.html for further details.
Old Way (via bootstrap) : User can configure a bootstrap which is basically a shell script executed on each node when the cluster starts and or upscales (more nodes are getting added to cluster). This can be configured via clusters UI and need a cluster start for every change. This is what is instructed in link you shared.
You cannot download/upgrade packages directly from the cell in the notebook. This is because your notebook is associated to a cluster. Now, to ensure that all the nodes of the cluster have the package installed, you must either use the package management (https://docs.qubole.com/en/latest/user-guide/package-management/package-management-environment.html) or the cluster's node bootstrap (https://docs.qubole.com/en/latest/user-guide/clusters/run-scripts-cluster.html#examples-node-scripts).
Do let me know if you have any further questions.

Location of Sqoop installation on Amazon EMR cluster?

I started an EMR cluster in order to use test out sqoop but it turns out it doesnt seem to be installed on the latest version of EMR(5.19.0) as I didnt find it in the directory /usr/lib/sqoop. I tried 5.18.0 as well but it was missing there too.
According to the application versions page, sqoop 1.4.7 should be installed on the cluster.
The EMR console gives me a list of 4 "installations". I chose the Core Hadoop package. It has Hive, Hue, etc installed in /usr/lib. Am I missing something here? It's my first time using EMR or sqoop.
I did not see the "Advanced Options" link at the top of the "Create Cluster" page where I can select individual software to install.
When creating an EMR cluster, use the advanced options link where it allows you to select sqoop.

How to setup and use Kafka-Connect-HDFS in HDP 2.4

I want to use kafka-connect-hdfs on hortonworks 2.4. Can you please help me with the steps i need to follow to setup in HDP env.
Other than building Kafka Connect HDFS from source, you can download and extract Confluent Platform's TAR.GZ files on your Hadoop nodes. That doesn't mean you are "installing Confluent"
Then you can cd /path/to/confluent-x.y.z/
And run Kafka Connect from there.
./bin/connect-standalone ./etc/kafka/connect-standalone.properties ./etc/kafka-connect-hdfs/quickstart-hdfs.properties
If that is working for you, then in order to run connect-distributed (the recommended way to run Kafka Connect), you need to download the same thing on the rest of the machines you want to run Kafka Connect on.