I use spark to compute parallelise tasks. In order to do it, my project is connected to a server that produces some data I need to start my spark job.
Now I would like to migrate my project to the cloud on aws.
I got my spark app on EMR and my server on EC2. How can I make my EMR spark app able to use http request on my EC2 server? Do I need something like a gateway?
Thanks,
Have a nice day.
Your EMR cluster actually runs on EC2 servers. You can always ssh to those servers. And then surely you can ssh to another ec2 server from emr ec2 server
According to my experience, you should use ssh hadoop#ec2-###-##-##-###.compute-1.amazonaws.com -i /path/mykeypair.pem instead of ssh -i /path/mykeypair.pem -ND 8157 hadoop#ec2-###-##-##-###-.compute.amazonaws.com. The second command has no response.
Related
I have a Node API deployed to Google CloudRun and it is responsible for managing external servers (clean, new Amazon EC2 Linux VM's), including through SSH and SFTP. SSH and SFTP actually work eventually but the connections take 2-5 MINUTES to initiate. Sometimes they timeout with handshake timeout errors.
The same service running on my laptop, connecting to the same external servers, has no issues and the connections are as fast as any normal SSH connection.
The deployment on CloudRun is pretty standard. I'm running it with a service account that permits access to secrets, etc. Plenty of memory allocated.
I have a VPC Connector set up, and have routed all traffic through the VPC connector, as per the instructions here: https://cloud.google.com/run/docs/configuring/static-outbound-ip
I also tried setting UseDNS no in the /etc/ssh/sshd_config file on the EC2 as per some suggestions online re: slow SSH logins, but that has not make a difference.
I have rebuilt and redeployed the project a few dozen times and all tests are on brand new EC2 instances.
I am attempting these connections using open source wrappers on the Node ssh2 library, node-ssh and ssh2-sftp-client.
Ideas?
Cloud Run works only until you have a HTTP request active.
You proably don't have an active request during this on Cloud Run, as outside of the active request the CPU is throttled.
Best for this pipeline is Cloud Workflows and regular Compute Engine instances.
You can setup a Workflow to start a Compute Engine for this task, and stop once it finished doing the steps.
I am the author of article: Run shell commands and orchestrate Compute Engine VMs with Cloud Workflows it will guide you how to setup.
Executing the Workflow can be triggered by Cloud Scheduler or by HTTP ping.
GCP has finally released managed Jupyter notebooks. I would like to be able to interact with the notebook locally by connecting to it. Ie. i use PyCharm to connect to the externaly configured jupyter notebbok server by passing its URL & token param.
Question also applies to AWS Sagemaker notebooks.
AWS does not natively support SSH-ing into SageMaker notebook instances, but nothing really prevents you from setting up SSH yourself.
The only problem is that these instances do not get a public IP address, which means you have to either create a reverse proxy (with ngrok for example) or connect to it via bastion box.
Steps to make the ngrok solution work:
download ngrok with curl https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip > ngrok.zip
unzip ngrok.zip
create ngrok free account to get permissions for tcp tunnels
run ./ngrok authenticate with your token
start with ./ngrok tcp 22 > ngrok.log & (& will put it in the background)
logfile will contain the url so you know where to connect to
create ~/.ssh/authorized_keys file (on SageMaker) and paste your public key (likely ~/.ssh/id_rsa.pub from your computer)
ssh by calling ssh -p <port_from_ngrok_logfile> ec2-user#0.tcp.ngrok.com (or whatever host they assign to you, it;s going to be in the ngrok.log)
If you want to automate it, I suggest using lifecycle configuration scripts.
Another good trick is wrapping downloading, unzipping, authenticating and starting ngrok into some binary in /usr/bin so you can just call it from SageMaker console if it dies.
It's a little bit too long to explain completely how to automate it with lifecycle scripts, but I've written a detailed guide on https://biasandvariance.com/sagemaker-ssh-setup/.
On AWS, you can use AWS Glue to create a developer endpoint, and then you create the Sagemaker notebook from there. A developer endpoint gives you access to connect to your python or Scala spark REPL via ssh, and it also allows you to tunnel the connection and access from any other tool, including PyCharm.
For PyCharm professional we have even tighter integration, allowing you to SFTP files and debug remotely.
And if you need to install any dependencies on the notebook, apart from doing it directly on the notebook, you can always choose new>terminal and you will have a connection to that machine directly from your jupyter environment where you can install anything you want.
There is a way to SSH into a Sagemaker notebook instance without having to use a third party reverse proxy like ngrok, nor setup an EC2 bastion, nor using AWS Systems Manager, here is how you can do it.
Prerequisites
Use your own VPC and not the VPC managed by AWS/Sagemaker for the notebook instance
Configure an ingress rule in the security group of your notebook instance to allow SSH traffic (port 22 over TCP)
How to do it
Create a lifecycle script configuration that is executed when the instance starts
Add the following snippet inside the lifecycle script :
INSTANCE_IP=$(/sbin/ifconfig eth2 | grep 'inet addr:' | cut -d: -f2 | awk '{ print $1}')
echo "SSH into the instance using : ssh ec2-user#$INSTANCE_IP" > ~ec2-user/SageMaker/ssh-instructions.txt
Add your public SSH key inside /home/ec2-user/.ssh/authorized_keys, either manually with the terminal of jupyterlab UI, or inside the lifecycle script above
When your users open the Jupyter interface, they will find the ssh-instructions.txt file which gives the host and command to use : ssh ec2-user#<INSTANCE_IP>
If you want to SSH from a local environment, you'll probably need to connect to your VPN that routes your traffic inside your VPC.
GCP's AI Platform Notebooks automatically creates a persistent URL which you can use to access your notebook. Is that what you were looking for?
Try using CreatePresignedNotebookInstanceUrl to access your notebook instance using an url.
I want to run a python script, (which I have run on a docker ubuntu installation) on AWS. It sends data to from Twitter to Elastic Search. I want to run it on Amazon Elasticsearch Service. I have set up Amazon Elasticsearch Service on AWS but I don't know how to get the script into the system and get it running.
What would the ssh be to access the Elastic Search Server?
Once I am able to access it where would I place a python script in order to feed data into the Elasticsearch server?
I tried
PS C:\Users\hpBill> ssh root#search-wbcelastic-*******.us-east-1.es.amazonaws.com/
but just get this:
ssh.exe": search-wbcelastic-**********.us-east-1.es.amazonaws.com/: no address associated with name
I have this information
Domain status
Active
Endpoint
search-wbcelastic-*********.us-east-1.es.amazonaws.com
Domain ARN
arn:aws:es:us-e******1:domain/wbcelastic
Kibana
search-wbcelastic-********.us-east-1.es.amazonaws.com/_plugin/kibana/
You cannot SSH directly into AWS Cloud Search. So that SSH command will never work. You have two option to run the Python script either launch a EC2 instance with AWS CLI or store and run the script from your local machine with AWS CLI. Here is the developer guide for the AWS CLI for Cloud Search
http://docs.aws.amazon.com/cloudsearch/latest/developerguide/using-cloudsearch-command-line-tools.html
I am new to spark and AWS, I am trying to install Jupyter on my Spark cluster (EMR), i am not able to open Jupyter Notebook on my browser in the end.
Context: I have firewall issues from the place i am working, i can't get access to the EMR clsuter's IP address i create on a day-to-day basis. I have a dedicated EC-2 instance (IP address for this instance is white listed) that i am using as a client to connect to the EMR cluster i create on a need basis.
I have access to the IP address of the EC2 instance and the ports 22 and 8080.
I do not have access to the IP address of EMR cluster.
Following are the steps that i am following:
Open putty and connect to the EC2 instance
Establish connection between my EC2 instance and EMR cluster
ssh -i publickey.pem ec2-user#host name of the EMR cluster
install jupyter on the spark cluster using the following command:
pip install jupyter
Connect to spark:
PYSPARK_DRIVER_PYTHON=/usr/local/bin/jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --port=7777" pyspark --packages com.databricks:spark-csv_2.10:1.1.0 --master spark://127.0.0.1:7077 --executor-memory 6400M --driver-memory 6400M
Establish a tunnel to browser:
ssh -L 0.0.0.0:8080:127.0.0.1:7777 ip-172-31-34-209 -i publickey.pem
open Jupyter on browser:
http://host name of EMR cluster:8080
I am able to run the first 5 steps, but not able to open the Jupyter notebook on my browser.
Didn't test it, as it involves setting up a test EMR server, but here's what should work:
Step 5:
ssh -i publickkey.pem -L 8080:127.0.0.1:7777 HOSTNAME
Step 6:
Open jupyter notebook on browser using 127.0.0.1:8080
You can use an EMR notebook with Amazon EMR clusters running Apache Spark to remotely run queries and code. An EMR notebook is a "serverless" Jupyter notebook. EMR notebook sits outside the cluster and takes care of cluster attachment without you having to worry about it.
More information here: https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-managed-notebooks.html
I am running Spark EMR on Amazon EC2 and I am trying to tunnel Jupyter Notebook and Zeppelin, so I can access them locally.
I tried running the below command with no success:
ssh -i ~/user.pem -ND 8157 hadoop#ec2-XX-XX-XXX-XX.compute-1.amazonaws.com
What exactly is tunnelling and how can I set it up so I can use Jupyter Notebook and Zeppelin on EMR?
Is there a way to I set up a basic configuration to make this work?
Many thanks.
Application ports like 8890, for Zeppelin on the master node, are not exposed outside of the cluster. So, if you are trying to access the notebook from your laptop, it will not work. SSH tunneling is a way to access these ports via SSH, securely. You are missing at least one step outlined in Set Up an SSH Tunnel to the Master Node Using Dynamic Port Forwarding. Specifically, "After the tunnel is active, configure a SOCKS proxy for your browser."