How to download Zeppelin Notebook from AWS EMR - amazon-web-services

I am running a pre-installed Zeppelin Sandbox on AWS EMR 4.3 with Spark.
I've created a Notebook on Zeppelin (on the EMR cluster) and I now want to export that notebook so that I can quickly run it the next time I spin up an EMR cluster.
It turns out that Zeppelin doesn't support the export of a notebook as yet (?).
This is fine because apparently, if you can access the folder Zeppelin is 'installed' in, then you can save the folder containing the notebook and then presumably place the folder in a Zeppelin installation on another computer to access the notebook.
(All this is from http://fedulov.website/2015/10/16/export-apache-zeppelin-notebooks/)
Trouble is I can't find where the 'Installation folder' for Zeppelin is on EMR.
ps - 'Installation Folder' may be slightly incorrect, according to the post above I should be looking in /opt/zeppelin, which doesn't exist in the Master of my EMR cluster.

Edit: Now Zeppelin supports export of the notebook in json format from the web interface itself ! There is a small icon on the center top of the page which allows you to export the notebook.
Zeppelin Notebooks can be found under /var/lib/zeppelin/notebook in an AWS EMR cluster with Zeppelin Sandbox. The notebooks are contained within folders in this directory.
These folders have random names and do not correspond to the name of the Notebook.
ls /var/lib/zeppelin/notebook/
2A94M5J1Y 2A94M5J1Z 2AZU1YEZE 2B3D826UD
There's a note.json file within each folder (which represents a Notebook) that contains the name of the Notebook and all other details.
To export a Notebook choose the notebook folder which corresponds to the notebook you are looking for copy the folder onto the new Zeppelin installation you want the notebook to be available in.
The above instructions are from: http://fedulov.website/2015/10/16/export-apache-zeppelin-notebooks/
Just that in an AWS setup the Zeppelin notebooks will be found in /var/lib/zeppelin/notebook

Other solution will be creating a step in your EMR cluster to backup all your Notebooks due going one per one is a bit tedious.
s3://{s3_bucket}/notebook/notebook_backup.sh
#!/bin/bash
# - Upload Notebooks backups.
aws s3 cp /var/lib/zeppelin/notebook/ s3://{s3_bucket}/notebook/`date +"%Y/%m/%d"` --recursive
# - Update latest folder with latest Notebooks versions.
aws s3 rm s3://{s3_bucket}/notebook/latest --recursive
aws s3 cp /var/lib/zeppelin/notebook/ s3://{s3_bucket}/notebook/latest --recursive
Then in your EMR add a Step to run your own script.
s3://elasticmapreduce/libs/script-runner/script-runner.jar will allow you to run scripts from S3.

Zeppelin release (0.5.6) and later, which is included in Amazon EMR release 4.4.0 and later supports using a Configuration json file to set the notebook storage.
https://aws.amazon.com/blogs/big-data/import-zeppelin-notes-from-github-or-json-in-zeppelin-0-5-6-on-amazon-emr/
You need to create a directory in an S3 bucket called /user/notebook
(user is the name as per the config below)
So if your S3 bucket is
S3://my-zeppelin-bucket-name
You need:
S3://my-zeppelin-bucket-name/user/notebook
and in the below config you don't include the S3:// prefix
You save this as .json file and then store it in an S3 bucket, and when you go to launch your cluster, there's a section for Configuration where you point it to this file. Then when the cluster launches, the pieces of the configuration are injected into various configs for different hadoop tools on EMR. In this case the zeppelin-env is going to be edited at launch, prior to it installing Zeppelin.
Once you've run a cluster once, you can then clone it and it will remember this config, or use cloudformation or something like ansible to script this so your clusters always start up with storage of notebooks on S3.
[
{
"Classification": "zeppelin-env",
"Properties": {
},
"Configurations": [
{
"Classification": "export",
"Properties": {
"ZEPPELIN_NOTEBOOK_STORAGE":"org.apache.zeppelin.notebook.repo.S3NotebookRepo",
"ZEPPELIN_NOTEBOOK_S3_BUCKET":"my-zeppelin-bucket-name",
"ZEPPELIN_NOTEBOOK_USER":"user"
},
"Configurations": [
]
}
]
}
]

Related

Deploy to AWS EC2 from AWS S3 via Bitbucket Pipelines

I have a requirement to do CI/CD using Bitbucket Pipelines.
We use Maven to build our code on Bitbucket pipelines and push the artifacts (jars) to AWS S3. The missing link is to figure out a way to get the artifacts from S3 and deploy to our EC2 instance.
It should all work from Bitbucket Pipelines yml - hopefully using Maven plugins.
For pushing the artifacts to S3 we use:
<groupId>com.gkatzioura.maven.cloud</groupId>
<artifactId>s3-storage-wagon</artifactId>
Is there a way/plugin that will download the artifact from S3's bucket and deploy it to EC2 instance specific folder and perhaps call a sh script to run the jars?
Thank you!
Use AWS Code Deploy (https://docs.aws.amazon.com/codedeploy/latest/userguide/welcome.html) to deploy it to the EC2 instance. The trigger for code deploy would be the S3 bucket that you deploy your jars to. You will need to turn S3 versioning on to make it work. Code Deploy has it's own set of hooks that you can use to perform any shell command or run any bat files on the EC2.

AWS Cloudformation Windows 2016 EC2 S3 silent install

I have architecture created using CloudFormation utilizing Windows 2016 EC2 server and S3, written in JSON. I have 7 executables uploaded onto my S3 bucket. I can manually silently install everything from a Powershell for AWS prompt, once I Remote into the EC2. I can do it one at a time, and even have it in a .ps1 file and run it in Powershell for AWS and it runs correctly.
I am now trying to get this to install silently when the EC2 instance is created. I just can't do it and I can't understand why. The JSON code looks correct. As you can see, I first download everything from the S3 bucket, switch to the c:\TEMP directory where they were all downloaded, then run the executables in unattended install mode. I don't get any errors in my CloudFormation template. It runs "successfully." The problem is that nothing happens. Is it a permissions thing? Any help is welcome and appreciated. Thanks!
Under the AWS::EC2::Instance section I have the UserData section looking something like this (I shortened the executable names below):
"UserData" : { "Fn::Base64" : { "Fn::Join" : ["", [
"<powershell>\n",
"copy-S3Object -BucketName mySilentInstallBucket -KeyPrefix * -LocalFolder c:\\TEMP\\",
"\n",
"cd c:\\TEMP\\",
"\n",
"firefox.exe -S ",
"\n",
"notepadpp.exe /S",
"\n",
"Git.exe /SILENT",
"\n",
"</powershell>"
]]}}
This troubleshooting doc will cover the various reasons you may not be able to connect to S3: https://aws.amazon.com/premiumsupport/knowledge-center/ec2-instance-access-s3-bucket/
To connect to your S3 buckets from your EC2 instances, you need to do
the following:
Create an AWS Identity and Access Management (IAM) profile role that grants access to Amazon S3.
Attach the IAM instance profile to the instance.
Validate permissions on your S3 bucket.
Validate network connectivity from the EC2 instance to Amazon S3.
Validate access to S3 buckets.
The CloudFormation template won't fail based on UserData execution exceptions.

Slurm cluster in Google cloud: Data in mounted directory in controller/login node not available in compute nodes

I have created a slurm cluster following this tutorial. I have also created a data bucket that stores some data that needs to be accessed in the compute nodes. Since the compute nodes share the home directory of the login node, I mounted the bucket in my login node using gcsfuse. However, if I execute a simple script test.py that prints the contents of mounted directory it is just empty. The folder is there as well as the python file.
Is there something that I have to specify in the yaml configuration file that enables having access to the mounted directory?
I have written down the steps that I have taken in order to mount the directory:
When creating the Slurm cluster using
gcloud deployment-manager deployments create google1 --config slurm-cluster.yaml
it is important that the node that should mount the storage directory has sufficient permissions.
Ucnomment/add the following in the slurm-cluster.yaml file if your login node should mount the data. (Do the same just with the controller node instead if you prefer).
login_node_scopes :
- https://www.googleapis.com/auth/devstorage.read_write
Next, log into the log-in node and install gcsfuse. After having installed gcsfuse, you can mount the bucket using the following command
gcsfuse --implicit-dirs <BUCKET-NAME> target/folder/
Note, the service account which is being attached to your VM has to have access rights on the bucket. You can find the name of the service account in the details of your VM in the cloud console or by running the following command on the VM:
gcloud auth list
I've just got a similar setup working. I don't have a definite answer to why yours isn't, but a few notes:
gcsfuse is installed per default, no need to explicitly install it.
You need to wait for the Slurm install to be fully finished before the bucket is available.
The "devstorage.read_write" appears to be needed.
I have the following under the login_machine_type in the yaml file:
network_storage :
- server_ip: none
remote_mount: mybucket
local_mount: /data
fs_type: gcsfuse
mount_options: file_mode=664,dir_mode=775,allow_other

How to pull Docker image from a private repository using AWS Batch?

I'm using AWS Batch and my Docker image is hosted on private Nexus repo. I'm trying to create the Job Definition but i can't find anywere how to specify the Repo Credentials like we did with a Task Definition in ECS.
I tried to manually specify it in the Json like that :
{
"command": ["aws", "s3", "ls"],
"image": "nexus-docker-repo.xxxxx.xxx/my-image",
"memory": 1024,
"vcpus": 1,
"repositoryCredentials": {
"credentialsParameter": "ARN_OF_CREDENTIALS"
},
"jobRoleArn" : "ARN_OF_THE_JOB"
}
But when i apply the changes the parameter credentialsParameter was removed . I think that it's not supported.
So how to pull an image from a private repo with AWS Batch ? Is it possible ?
Thank you.
I do not see the option repositoryCredentials either in the batch job definition.
A secure option could be
Generate the config.json for docker login
Place that file in s3
Generate an IAM role that has access to that file.
Create a compute environment with a
Launch Template and user data to download the config.json
Run the jobs with that compute environment.
Ok i was able to do it by modifying the file /etc/ecs/ecs.config
If the file is not there you have to create it.
Then I had to add these 2 lines in that file :
ECS_ENGINE_AUTH_TYPE=docker
ECS_ENGINE_AUTH_DATA={"https://index.docker.io/v1/":{"username":"admin","password":"admin","email":"admin#example.com "}}
Then i had to restart the ECS agent :
sudo systemctl restart ecs ## for the Amazon ECS-optimized Amazon Linux 2 AMI
Or
sudo stop ecs && sudo start ecs ## for For the Amazon ECS-optimized Amazon Linux AMI

spark-submit from outside AWS EMR cluster

I have an AWS EMR cluster running spark, and I'd like to submit a PySpark job to it from my laptop (--master yarn) to run in cluster mode.
I know that I need to set up some config on the laptop, but I'd like to know what the bare minimum is. Do I just need some of the config files from the master node of the cluster? If so, which? Or do I need to install hadoop or yarn on my local machine?
I've done a fair bit of searching for an answer, but I haven't yet been able to be sure that what I was reading referred to launching a job from the master of the cluster or some arbitrary laptop...
If you want to run the spark-submit job solely on your AWS EMR cluster, you do not need to install anything locally. You only need the EC2 key pair you specified in the Security Options when you created the cluster.
I personally scp over any relevant scripts &/or jars, ssh into the master node of the cluster, and then run spark-submit.
You can specify most of the relevant spark job configurations via spark-submit itself. AWS documents in some more detail how to configure spark-submit jobs.
For example:
>> scp -i ~/PATH/TO/${SSH_KEY} /PATH/TO/PYSPARK_SCRIPT.py hadoop#${PUBLIC_MASTER_DNS}:
>> ssh -i ~/PATH/TO/${SSH_KEY} hadoop#${PUBLIC_MASTER_DNS}
>> spark-submit --conf spark.OPTION.OPTION=VALUE PYSPARK_SCRIPT.py
However, if you already pass a particular configuration when creating the cluster itself, you do not need to re-specify those same configuration options via spark-submit.
You can setup the AWS CLI on your local machine, put your deployment on S3, and then add an EMR step to run on the EMR cluster. Something like this:
aws emr add-steps --cluster-id j-xxxxx --steps Type=spark,Name=SparkWordCountApp,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=false,--num-executors,5,--executor-cores,5,--executor-memory,20g,s3://codelocation/wordcount.py,s3://inputbucket/input.txt,s3://outputbucket/],ActionOnFailure=CONTINUE
Source: https://aws.amazon.com/de/blogs/big-data/submitting-user-applications-with-spark-submit/