Using Jupyter notebook on Spark on EMR - amazon-web-services

I am new to spark and AWS, I am trying to install Jupyter on my Spark cluster (EMR), i am not able to open Jupyter Notebook on my browser in the end.
Context: I have firewall issues from the place i am working, i can't get access to the EMR clsuter's IP address i create on a day-to-day basis. I have a dedicated EC-2 instance (IP address for this instance is white listed) that i am using as a client to connect to the EMR cluster i create on a need basis.
I have access to the IP address of the EC2 instance and the ports 22 and 8080.
I do not have access to the IP address of EMR cluster.
Following are the steps that i am following:
Open putty and connect to the EC2 instance
Establish connection between my EC2 instance and EMR cluster
ssh -i publickey.pem ec2-user#host name of the EMR cluster
install jupyter on the spark cluster using the following command:
pip install jupyter
Connect to spark:
PYSPARK_DRIVER_PYTHON=/usr/local/bin/jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --packages com.databricks:spark-csv_2.10:1.1.0 --master spark://127.0.0.1:7077 --executor-memory 6400M --driver-memory 6400M
Establish a tunnel to browser:
ssh -L 0.0.0.0:8080:127.0.0.1:7777 ip-172-31-34-209 -i publickey.pem
open Jupyter on browser:
http://host name of EMR cluster:8080
I am able to run the first 5 steps, but not able to open the Jupyter notebook on my browser.

Didn't test it, as it involves setting up a test EMR server, but here's what should work:
Step 5:
ssh -i publickkey.pem -L 8080:127.0.0.1:7777 HOSTNAME
Step 6:
Open jupyter notebook on browser using 127.0.0.1:8080

You can use an EMR notebook with Amazon EMR clusters running Apache Spark to remotely run queries and code. An EMR notebook is a "serverless" Jupyter notebook. EMR notebook sits outside the cluster and takes care of cluster attachment without you having to worry about it.
More information here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks.html

Related

How to start jupyter notebook on AWS

I am a beginner for Amazon Ec2 and recently I successfully ssh to EC2 instance. yet when I tried to activate jupyter before ssh:
jupyter notebook --no-browser --port=8888
I get the message:
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8888/?token=????????????????????
I copied the URL as instructed to the browser (chrome and safari), but it did not work. How could I proceed to ssh jupyter notebook? Thanks!
I hope you just didn't copied the link as it is (locahost), it is running on ec2, not on your computer. So change the server name to IP address of your EC2 instance (assuming you allowed traffic on correct ports)
There are a few guides how to access jupyter notebooks on remote servers, e. g. see
https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#notebook-public-server
If you are just playing along and don't care about security in this case, you may just update the binding IP in your jupyter_notebook_config.py :
c.NotebookApp.ip = '*'
You can start the jupyter server using the following command:-
jupyter notebook --ip=*
If you want to keep it running even after the terminal is closed then use:-
nohup jupyter notebook --ip=* > nohup_jupyter.out&
Remember to open the port 8888 in the AWS EC2 security group inbound to Anywhere (0.0.0.0/0, ::/0)
Then you can access jupyter using http://:8888
Hope this helps.

How to open a Jupyter notebook on browser from AWS instance?

I am trying to open on my local browser the Jupyter notebook opened through the AWS instance.
I am using m4.2xlarge instance from the ami-d36386aa, Ireland region.
I then typed inside the instance command line:
jupyter notebook --no-browser --ip=xx.xx.xx.xx
Where the ip is the public ip from AWS.
run Jupyter notebook
given output:
http://ec2-xx-xxx-x-xxx.eu-west-1.compute.amazonaws.com:8888/?token=....
I copied the output from (2) to my browser but nothing is happening.
Any suggestions?

How to connect EMR Cluster to EC2 server

I use spark to compute parallelise tasks. In order to do it, my project is connected to a server that produces some data I need to start my spark job.
Now I would like to migrate my project to the cloud on aws.
I got my spark app on EMR and my server on EC2. How can I make my EMR spark app able to use http request on my EC2 server? Do I need something like a gateway?
Thanks,
Have a nice day.
Your EMR cluster actually runs on EC2 servers. You can always ssh to those servers. And then surely you can ssh to another ec2 server from emr ec2 server
According to my experience, you should use ssh hadoop#ec2-###-##-##-###.compute-1.amazonaws.com -i /path/mykeypair.pem instead of ssh -i /path/mykeypair.pem -ND 8157 hadoop#ec2-###-##-##-###-.compute.amazonaws.com. The second command has no response.

How may I connect Google Dataproc cluster from Sparklyr?

I'm new to Spark and GCP. I've tried to connect to it with
sc <- spark_connect(master = "IP address")
but it obviously couldn't work (e.g. there is no authentication).
How should I do that? Is it possible to connect to it from outside Google Cloud?
There are two issues with connecting to Spark on Dataproc from outside a cluster: Configuration and Network access. It is generally somewhat difficult and not fully supported, So I would recommend using sparklyr inside the cluster.
Configuration
Google Cloud Dataproc runs Spark on Hadoop YARN. You actually need to use yarn-cluster:
sc <- spark_connect(master = 'yarn-client')
However you also need a yarn-site.xml in your $SPARK_HOME directory to point Spark to the right hostname.
Network Access
While you can open ports to your IP address using firewall rules on your Google Compute Engine network, it's not considered a good security practice. You would also need to configure YARN to use the instance's external IP address or have a way to resolve hostnames on your machine.
Using sparklyr on Dataproc
sparklyr can be installed and run with R REPL by SSHing into the master node and running:
$ # Needed for the curl library
$ sudo apt-get install -y libcurl4-openssl-dev
$ R
> install.packages('sparklyr')
> library(sparklyr)
> sc <- spark_connect(master = 'yarn-client')
I believe RStudio Server supports SOCKS proxies, which can be set up as described here, but I am not very familiar with RStudio.
I use Apache Zeppelin on Dataproc for R notebooks, but it autoloads SparkR, which I don't think plays well with sparklyr at this time.

Amazon EMR Tunneling Zeppelin and Jupyter Notebook

I am running Spark EMR on Amazon EC2 and I am trying to tunnel Jupyter Notebook and Zeppelin, so I can access them locally.
I tried running the below command with no success:
ssh -i ~/user.pem -ND 8157 hadoop#ec2-XX-XX-XXX-XX.compute-1.amazonaws.com
What exactly is tunnelling and how can I set it up so I can use Jupyter Notebook and Zeppelin on EMR?
Is there a way to I set up a basic configuration to make this work?
Many thanks.
Application ports like 8890, for Zeppelin on the master node, are not exposed outside of the cluster. So, if you are trying to access the notebook from your laptop, it will not work. SSH tunneling is a way to access these ports via SSH, securely. You are missing at least one step outlined in Set Up an SSH Tunnel to the Master Node Using Dynamic Port Forwarding. Specifically, "After the tunnel is active, configure a SOCKS proxy for your browser."