Amazon EMR Tunneling Zeppelin and Jupyter Notebook - amazon-web-services

I am running Spark EMR on Amazon EC2 and I am trying to tunnel Jupyter Notebook and Zeppelin, so I can access them locally.
I tried running the below command with no success:
ssh -i ~/user.pem -ND 8157 hadoop#ec2-XX-XX-XXX-XX.compute-1.amazonaws.com
What exactly is tunnelling and how can I set it up so I can use Jupyter Notebook and Zeppelin on EMR?
Is there a way to I set up a basic configuration to make this work?
Many thanks.

Application ports like 8890, for Zeppelin on the master node, are not exposed outside of the cluster. So, if you are trying to access the notebook from your laptop, it will not work. SSH tunneling is a way to access these ports via SSH, securely. You are missing at least one step outlined in Set Up an SSH Tunnel to the Master Node Using Dynamic Port Forwarding. Specifically, "After the tunnel is active, configure a SOCKS proxy for your browser."

Related

Jenkins not connecting to AWS EC2 instance via SSH

I am trying to connect to an EC2 instance from Jenkins via SSH. I always get failure in the end. I am storing the SSH key in a global credential.
This is the task and shell, using SSH agent plugin
This is how I store the key (the whole key has been pasted in)
If I am using SSH connection from my local PC, everything is fine. I am a newbie in Jenkins so this is very chaotic for me.
you need to use SSH plugin . download the plugin using Manage Jenkins and configure
the ec2 in SSH remote.
follow the steps in this link
https://www.thesunflowerlab.com/blog/jenkins-aws-ec2-instance-ssh/

How to remotely connect to GCP ML Engine/AWS Sagemaker managed notebooks?

GCP has finally released managed Jupyter notebooks. I would like to be able to interact with the notebook locally by connecting to it. Ie. i use PyCharm to connect to the externaly configured jupyter notebbok server by passing its URL & token param.
Question also applies to AWS Sagemaker notebooks.
AWS does not natively support SSH-ing into SageMaker notebook instances, but nothing really prevents you from setting up SSH yourself.
The only problem is that these instances do not get a public IP address, which means you have to either create a reverse proxy (with ngrok for example) or connect to it via bastion box.
Steps to make the ngrok solution work:
download ngrok with curl https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip > ngrok.zip
unzip ngrok.zip
create ngrok free account to get permissions for tcp tunnels
run ./ngrok authenticate with your token
start with ./ngrok tcp 22 > ngrok.log & (& will put it in the background)
logfile will contain the url so you know where to connect to
create ~/.ssh/authorized_keys file (on SageMaker) and paste your public key (likely ~/.ssh/id_rsa.pub from your computer)
ssh by calling ssh -p <port_from_ngrok_logfile> ec2-user#0.tcp.ngrok.com (or whatever host they assign to you, it;s going to be in the ngrok.log)
If you want to automate it, I suggest using lifecycle configuration scripts.
Another good trick is wrapping downloading, unzipping, authenticating and starting ngrok into some binary in /usr/bin so you can just call it from SageMaker console if it dies.
It's a little bit too long to explain completely how to automate it with lifecycle scripts, but I've written a detailed guide on https://biasandvariance.com/sagemaker-ssh-setup/.
On AWS, you can use AWS Glue to create a developer endpoint, and then you create the Sagemaker notebook from there. A developer endpoint gives you access to connect to your python or Scala spark REPL via ssh, and it also allows you to tunnel the connection and access from any other tool, including PyCharm.
For PyCharm professional we have even tighter integration, allowing you to SFTP files and debug remotely.
And if you need to install any dependencies on the notebook, apart from doing it directly on the notebook, you can always choose new>terminal and you will have a connection to that machine directly from your jupyter environment where you can install anything you want.
There is a way to SSH into a Sagemaker notebook instance without having to use a third party reverse proxy like ngrok, nor setup an EC2 bastion, nor using AWS Systems Manager, here is how you can do it.
Prerequisites
Use your own VPC and not the VPC managed by AWS/Sagemaker for the notebook instance
Configure an ingress rule in the security group of your notebook instance to allow SSH traffic (port 22 over TCP)
How to do it
Create a lifecycle script configuration that is executed when the instance starts
Add the following snippet inside the lifecycle script :
INSTANCE_IP=$(/sbin/ifconfig eth2 | grep 'inet addr:' | cut -d: -f2 | awk '{ print $1}')
echo "SSH into the instance using : ssh ec2-user#$INSTANCE_IP" > ~ec2-user/SageMaker/ssh-instructions.txt
Add your public SSH key inside /home/ec2-user/.ssh/authorized_keys, either manually with the terminal of jupyterlab UI, or inside the lifecycle script above
When your users open the Jupyter interface, they will find the ssh-instructions.txt file which gives the host and command to use : ssh ec2-user#<INSTANCE_IP>
If you want to SSH from a local environment, you'll probably need to connect to your VPN that routes your traffic inside your VPC.
GCP's AI Platform Notebooks automatically creates a persistent URL which you can use to access your notebook. Is that what you were looking for?
Try using CreatePresignedNotebookInstanceUrl to access your notebook instance using an url.

How to start jupyter notebook on AWS

I am a beginner for Amazon Ec2 and recently I successfully ssh to EC2 instance. yet when I tried to activate jupyter before ssh:
jupyter notebook --no-browser --port=8888
I get the message:
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://localhost:8888/?token=????????????????????
I copied the URL as instructed to the browser (chrome and safari), but it did not work. How could I proceed to ssh jupyter notebook? Thanks!
I hope you just didn't copied the link as it is (locahost), it is running on ec2, not on your computer. So change the server name to IP address of your EC2 instance (assuming you allowed traffic on correct ports)
There are a few guides how to access jupyter notebooks on remote servers, e. g. see
https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#notebook-public-server
If you are just playing along and don't care about security in this case, you may just update the binding IP in your jupyter_notebook_config.py :
c.NotebookApp.ip = '*'
You can start the jupyter server using the following command:-
jupyter notebook --ip=*
If you want to keep it running even after the terminal is closed then use:-
nohup jupyter notebook --ip=* > nohup_jupyter.out&
Remember to open the port 8888 in the AWS EC2 security group inbound to Anywhere (0.0.0.0/0, ::/0)
Then you can access jupyter using http://:8888
Hope this helps.

How to connect EMR Cluster to EC2 server

I use spark to compute parallelise tasks. In order to do it, my project is connected to a server that produces some data I need to start my spark job.
Now I would like to migrate my project to the cloud on aws.
I got my spark app on EMR and my server on EC2. How can I make my EMR spark app able to use http request on my EC2 server? Do I need something like a gateway?
Thanks,
Have a nice day.
Your EMR cluster actually runs on EC2 servers. You can always ssh to those servers. And then surely you can ssh to another ec2 server from emr ec2 server
According to my experience, you should use ssh hadoop#ec2-###-##-##-###.compute-1.amazonaws.com -i /path/mykeypair.pem instead of ssh -i /path/mykeypair.pem -ND 8157 hadoop#ec2-###-##-##-###-.compute.amazonaws.com. The second command has no response.

Using Jupyter notebook on Spark on EMR

I am new to spark and AWS, I am trying to install Jupyter on my Spark cluster (EMR), i am not able to open Jupyter Notebook on my browser in the end.
Context: I have firewall issues from the place i am working, i can't get access to the EMR clsuter's IP address i create on a day-to-day basis. I have a dedicated EC-2 instance (IP address for this instance is white listed) that i am using as a client to connect to the EMR cluster i create on a need basis.
I have access to the IP address of the EC2 instance and the ports 22 and 8080.
I do not have access to the IP address of EMR cluster.
Following are the steps that i am following:
Open putty and connect to the EC2 instance
Establish connection between my EC2 instance and EMR cluster
ssh -i publickey.pem ec2-user#host name of the EMR cluster
install jupyter on the spark cluster using the following command:
pip install jupyter
Connect to spark:
PYSPARK_DRIVER_PYTHON=/usr/local/bin/jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --packages com.databricks:spark-csv_2.10:1.1.0 --master spark://127.0.0.1:7077 --executor-memory 6400M --driver-memory 6400M
Establish a tunnel to browser:
ssh -L 0.0.0.0:8080:127.0.0.1:7777 ip-172-31-34-209 -i publickey.pem
open Jupyter on browser:
http://host name of EMR cluster:8080
I am able to run the first 5 steps, but not able to open the Jupyter notebook on my browser.
Didn't test it, as it involves setting up a test EMR server, but here's what should work:
Step 5:
ssh -i publickkey.pem -L 8080:127.0.0.1:7777 HOSTNAME
Step 6:
Open jupyter notebook on browser using 127.0.0.1:8080
You can use an EMR notebook with Amazon EMR clusters running Apache Spark to remotely run queries and code. An EMR notebook is a "serverless" Jupyter notebook. EMR notebook sits outside the cluster and takes care of cluster attachment without you having to worry about it.
More information here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks.html