How may I connect Google Dataproc cluster from Sparklyr? - google-cloud-platform

I'm new to Spark and GCP. I've tried to connect to it with
sc <- spark_connect(master = "IP address")
but it obviously couldn't work (e.g. there is no authentication).
How should I do that? Is it possible to connect to it from outside Google Cloud?

There are two issues with connecting to Spark on Dataproc from outside a cluster: Configuration and Network access. It is generally somewhat difficult and not fully supported, So I would recommend using sparklyr inside the cluster.
Configuration
Google Cloud Dataproc runs Spark on Hadoop YARN. You actually need to use yarn-cluster:
sc <- spark_connect(master = 'yarn-client')
However you also need a yarn-site.xml in your $SPARK_HOME directory to point Spark to the right hostname.
Network Access
While you can open ports to your IP address using firewall rules on your Google Compute Engine network, it's not considered a good security practice. You would also need to configure YARN to use the instance's external IP address or have a way to resolve hostnames on your machine.
Using sparklyr on Dataproc
sparklyr can be installed and run with R REPL by SSHing into the master node and running:
$ # Needed for the curl library
$ sudo apt-get install -y libcurl4-openssl-dev
$ R
> install.packages('sparklyr')
> library(sparklyr)
> sc <- spark_connect(master = 'yarn-client')
I believe RStudio Server supports SOCKS proxies, which can be set up as described here, but I am not very familiar with RStudio.
I use Apache Zeppelin on Dataproc for R notebooks, but it autoloads SparkR, which I don't think plays well with sparklyr at this time.

Related

Cannot access to localhost

I deployed an application on Google Cloud (GKE). In order to access its UI, I did port-forwarding(port 9090). When I use Cloud Shell web preview I can access the UI. However, when I tried to open localhost:9090 in my browser, I cannot access. Do you know why I cannot access from my browser, is it normal?
Thank you!
Answered provided in the comments by a community member.
Do you know why I cannot access from my browser, is it normal?
Cloud Shell is where you're running kubectl port-forward. Port forwarding only applies to the host on which the command is run unless you have a chain of port-forwarding commands. If you want to access the UI from your local host, then you will need to run the kubectl port-forward on your local host too.
So how can I can run kubectl port-forward command on my local host for the application that I deployed cloud? Should I install Google Cloud CLI on my local machine?
I assumed (!) that you're using kubectl port-forward on Cloud Shell. If that's correct, then you need to install kubectl on your local machine to run it there. Because of the way that GKE authenticates, it may also be prudent to install gcloud on your local machine. You can then use gcloud container clusters get-credentials ... to create a local Kubernete (GKE) config file on your local machine that is then used by kubectl commands.

How to remotely connect to GCP ML Engine/AWS Sagemaker managed notebooks?

GCP has finally released managed Jupyter notebooks. I would like to be able to interact with the notebook locally by connecting to it. Ie. i use PyCharm to connect to the externaly configured jupyter notebbok server by passing its URL & token param.
Question also applies to AWS Sagemaker notebooks.
AWS does not natively support SSH-ing into SageMaker notebook instances, but nothing really prevents you from setting up SSH yourself.
The only problem is that these instances do not get a public IP address, which means you have to either create a reverse proxy (with ngrok for example) or connect to it via bastion box.
Steps to make the ngrok solution work:
download ngrok with curl https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip > ngrok.zip
unzip ngrok.zip
create ngrok free account to get permissions for tcp tunnels
run ./ngrok authenticate with your token
start with ./ngrok tcp 22 > ngrok.log & (& will put it in the background)
logfile will contain the url so you know where to connect to
create ~/.ssh/authorized_keys file (on SageMaker) and paste your public key (likely ~/.ssh/id_rsa.pub from your computer)
ssh by calling ssh -p <port_from_ngrok_logfile> ec2-user#0.tcp.ngrok.com (or whatever host they assign to you, it;s going to be in the ngrok.log)
If you want to automate it, I suggest using lifecycle configuration scripts.
Another good trick is wrapping downloading, unzipping, authenticating and starting ngrok into some binary in /usr/bin so you can just call it from SageMaker console if it dies.
It's a little bit too long to explain completely how to automate it with lifecycle scripts, but I've written a detailed guide on https://biasandvariance.com/sagemaker-ssh-setup/.
On AWS, you can use AWS Glue to create a developer endpoint, and then you create the Sagemaker notebook from there. A developer endpoint gives you access to connect to your python or Scala spark REPL via ssh, and it also allows you to tunnel the connection and access from any other tool, including PyCharm.
For PyCharm professional we have even tighter integration, allowing you to SFTP files and debug remotely.
And if you need to install any dependencies on the notebook, apart from doing it directly on the notebook, you can always choose new>terminal and you will have a connection to that machine directly from your jupyter environment where you can install anything you want.
There is a way to SSH into a Sagemaker notebook instance without having to use a third party reverse proxy like ngrok, nor setup an EC2 bastion, nor using AWS Systems Manager, here is how you can do it.
Prerequisites
Use your own VPC and not the VPC managed by AWS/Sagemaker for the notebook instance
Configure an ingress rule in the security group of your notebook instance to allow SSH traffic (port 22 over TCP)
How to do it
Create a lifecycle script configuration that is executed when the instance starts
Add the following snippet inside the lifecycle script :
INSTANCE_IP=$(/sbin/ifconfig eth2 | grep 'inet addr:' | cut -d: -f2 | awk '{ print $1}')
echo "SSH into the instance using : ssh ec2-user#$INSTANCE_IP" > ~ec2-user/SageMaker/ssh-instructions.txt
Add your public SSH key inside /home/ec2-user/.ssh/authorized_keys, either manually with the terminal of jupyterlab UI, or inside the lifecycle script above
When your users open the Jupyter interface, they will find the ssh-instructions.txt file which gives the host and command to use : ssh ec2-user#<INSTANCE_IP>
If you want to SSH from a local environment, you'll probably need to connect to your VPN that routes your traffic inside your VPC.
GCP's AI Platform Notebooks automatically creates a persistent URL which you can use to access your notebook. Is that what you were looking for?
Try using CreatePresignedNotebookInstanceUrl to access your notebook instance using an url.

Using Jupyter notebook on Spark on EMR

I am new to spark and AWS, I am trying to install Jupyter on my Spark cluster (EMR), i am not able to open Jupyter Notebook on my browser in the end.
Context: I have firewall issues from the place i am working, i can't get access to the EMR clsuter's IP address i create on a day-to-day basis. I have a dedicated EC-2 instance (IP address for this instance is white listed) that i am using as a client to connect to the EMR cluster i create on a need basis.
I have access to the IP address of the EC2 instance and the ports 22 and 8080.
I do not have access to the IP address of EMR cluster.
Following are the steps that i am following:
Open putty and connect to the EC2 instance
Establish connection between my EC2 instance and EMR cluster
ssh -i publickey.pem ec2-user#host name of the EMR cluster
install jupyter on the spark cluster using the following command:
pip install jupyter
Connect to spark:
PYSPARK_DRIVER_PYTHON=/usr/local/bin/jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --packages com.databricks:spark-csv_2.10:1.1.0 --master spark://127.0.0.1:7077 --executor-memory 6400M --driver-memory 6400M
Establish a tunnel to browser:
ssh -L 0.0.0.0:8080:127.0.0.1:7777 ip-172-31-34-209 -i publickey.pem
open Jupyter on browser:
http://host name of EMR cluster:8080
I am able to run the first 5 steps, but not able to open the Jupyter notebook on my browser.
Didn't test it, as it involves setting up a test EMR server, but here's what should work:
Step 5:
ssh -i publickkey.pem -L 8080:127.0.0.1:7777 HOSTNAME
Step 6:
Open jupyter notebook on browser using 127.0.0.1:8080
You can use an EMR notebook with Amazon EMR clusters running Apache Spark to remotely run queries and code. An EMR notebook is a "serverless" Jupyter notebook. EMR notebook sits outside the cluster and takes care of cluster attachment without you having to worry about it.
More information here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks.html

Amazon EMR Tunneling Zeppelin and Jupyter Notebook

I am running Spark EMR on Amazon EC2 and I am trying to tunnel Jupyter Notebook and Zeppelin, so I can access them locally.
I tried running the below command with no success:
ssh -i ~/user.pem -ND 8157 hadoop#ec2-XX-XX-XXX-XX.compute-1.amazonaws.com
What exactly is tunnelling and how can I set it up so I can use Jupyter Notebook and Zeppelin on EMR?
Is there a way to I set up a basic configuration to make this work?
Many thanks.
Application ports like 8890, for Zeppelin on the master node, are not exposed outside of the cluster. So, if you are trying to access the notebook from your laptop, it will not work. SSH tunneling is a way to access these ports via SSH, securely. You are missing at least one step outlined in Set Up an SSH Tunnel to the Master Node Using Dynamic Port Forwarding. Specifically, "After the tunnel is active, configure a SOCKS proxy for your browser."

vagrant, puppet, aws but without vagrant on aws

So I have been googling for a while now and either I have completed the internet or I cannot articulate my search query to find the answer so I thought I would come here.
So my team and I want to use vagrant on our local machines which is fine. We want to use puppet for our configs. Now we don't want vagrant inside inside our AWS/DigitalOcean/Whatever providers instance. How do I get the puppet config to automatically build the instance for us ?
I am a little stuck, I think I need a puppet master but how does the AWS instance for example get built based on the puppet config and how does vagrant use the same config ?
Thanks
That's the default behavior if you install vagrant on your local workstation and configure an instance for AWS. Vagrant will connect over SSH to the instance and install client software (in this case puppet) to configure the instance.
In short: Vagrant will not install itself on any AWS-Instance.
Here's a link to the Vagrant-AWS Plugin:
Vagrant-AWS
Further information:
Vagrant uses providers to create VM's. The normal workflow is to use for example the virtualbox provider (which is build into vagrant) to create local VM's. You can set attributes for the specific provider in the Vagrantfile. In this case you need the Vagrant aws provider (which is a plugin -> vagrant plugin install <pluginname> command). Thus you can create VM's remotely. Just as with the virtualbox provider vagrant will not install itself on the created VM (remotely or not doesn't matter)
vagrant use masterless provisioning (Puppet Apply): script is running inside your vagrant box.
To provision machine in cloud you need puppet master server and puppet clients.
For automatically bootstrapping clients you can add shell script inside your server 'user-data': Digital Ocean , AWS EC2.
This script is responsible for installing puppet and connecting to master server.