What is a trusted notebook in DSX? - data-science-experience

What is a trusted notebook in DSX? What difference between a trusted and untrusted notebook?
The on-line documentation is not clear on this.

You can run all notebook cells in the Apache Spark service to ensure that the notebook is considered trusted by the service.
When you open a notebook, the Apache Spark service verifies the signature of the notebook. If the signature is not valid, the notebook is considered untrusted, and the javascript and HTML output will not be displayed. You must rerun all notebook cells to generate a new signature in the notebook metadata, and the notebook will be considered trusted.
More info here: https://datascience.ibm.com/docs/content/analyze-data/trusted-notebooks.html?context=analytics

Related

GCP AI notebook instance permission

I have a GCP AI notebook instance. Anyone with admin access to notebook in my project can open notebook and read, modify or delete any folder/file which is created by me or any other user in my team. Is there a way to create a private repository like /home/user, like we could've done if we used JupyterHub installed on a VM?
Implementing your requirement is not feasible with AI Notebooks. AI Notebooks is intended for a rapid prototyping and development environment that can be easily managed, and advanced multi-user scenarios fall outside its intended purpose.
The Python Kernel in AI Notebooks always runs under the Linux user "Jupyter" regardless of what GCP user accesses the notebook. Anyone who has editor permissions to your Google Cloud project can see each other's work through the Jupyter UI.
In order to isolate the user's work, the recommended option is to set up individual notebook instances for each user. Please find the 'Single User' option.
It’s not feasible to combine multiple instances to a master instance in AI Notebooks. So, the recommended ways to give each user a Notebook instance and share any source code via GIT or other repository system. Please find Save a Notebook to GitHub doc for more information.
You probably created a Notebook using Service Account mode. You can provide access to single users only via single-user mode
Example:
proxy-mode=mail,proxy-user-mail=user#domain.com

Error in launching Jupyter notebook on Google Cloud Platform

Followed this blog post on creating a new VM and trying to launch Jupyter notebook on GCP.
https://medium.com/#kn.maragatham09/installing-jupyter-notebook-on-google-cloud-11979e40cd10
Getting this error message
Did you try with the obvious ones?
Does the Cert exist? (maybe it was removed along the process..)
Does the Cert has correct ownership?
Does jupyter/user executing jupyter has the right to access the Cert?
Maybe you could try and run Jupyter in verbose mode. And post it here.
Fyi, GCP now offers a easy to use, pre-setup Jupyter Notebook environment called AI Platform Notebooks. Have you tried using that instead? You won't need to worry about setting up any certs :)

How do I migrate data from a local to a remote MySQL instance

I have a local MySQL database and I want to migrate the data inside of it to a remote MySQL database (using RDS on AWS). How can I migrate my data between the two instances?
AWS DMS helps you migrate large, terabyte-scale databases to the AWS
Cloud easily and securely. During migration, the source database
remains fully operational, minimizing downtime.
But judging form your question you want homogeneous data migration and as per AWS Documentation:
If you're performing a homogeneous migration, use your engine’s native
tools, such as MySQL dump or MySQL replication.
Refer to this answer for using SQL Dump on larger data.
Thanks
Use AWS database migration service that is available in Aws. You need to provide your database end-point I,e your on premises data- base server end-point in it and also set your db engine parameters to your requirement and launch. it talks 10-15 minutes to migrate your data to cloud and from there you can continue accessing your database from the AWS it self.
The other method is, take a recent back up of your on premises database. Launch an instance in aws EC2 and install the db that you are using on premises.copy the back file in your system to cloud.using the backup file available launch the database.set up an RDS instance of the type that you have installed in EC2, and connect the end points.

Does silent install of Tableau Server 10.2 on AWS EC2 instance work?

I am working on infrastructure automation using Ansible scripts. I installed Tableau Server 10.2 on EC2 instance manually. It requires you to accept terms and asks for registration. Out of the box Tableau doesn't support silent install, at least I didn't find anything on their forums. Has anyone tried automating the installation of Tableau Server?
I would start with the Quickstart CloudFormation scripts that Tableau and AWS have posted here: Quickstart URL.
They have a windows single machine install and a cluster, both of which use a python script to do the silent install. If your looking for the Linux setup, their is a branch with a CloudFormation that does a linux install on a single server.
If your Tableau install requires AD, the documentation out there will say that the Quickstart doesn't support it. However, if you check out the config.yml file used in the quick start and the documentation here it appears you can put in the AD configuration you need in the json file.
In your script, post install, you'll want to copy your backup configuration from s3, and perform a restore using the tabadmin restore backupFile. Most of what the CloudFormation does should be useable in Ansible.

How to run Python Spark code on Amazon Aws?

I have written a python code in spark and I want to run it on Amazon's Elastic Map reduce.
My code works great on my local machine, but I am slightly confused over how to run it on Amazon's AWS?
More specifically, how should I transfer my python code over to the Master node? Do I need to copy my Python code to my s3 bucket and execute it from there? Or, should I ssh into Master and scp my python code to the spark folder in Master?
For now, I tried running the code locally on my terminal and connecting to the cluster address ( I did this by reading the output of --help flag of spark, so I might be missing a few steps here)
./bin/spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.1 \
--master spark://hadoop#ec2-public-dns-of-my-cluster.compute-1.amazonaws.com \
mypythoncode.py
I tried it with and without my permissions file i.e.
-i permissionsfile.pem
However, it fails and the stack trace shows something on the lines of
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).
at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:66)
at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
......
......
Is my approach correct and I need to resolve the Access issues to get going or am I heading in a wrong direction?
What is the right way of doing it?
I searched a lot on youtube but couldn't find any tutorials on running Spark on Amazon's EMR.
If it helps, the dataset I am working on it is part of Amazon's public dataset.
go to EMR, create new cluster... [recommendation: start with 1 node only, just for testing purposes].
Click the checkbox to install Spark, you can uncheck the other boxes if you don't need those additional programs.
configure the cluster further by choosing a VPC and a security key (ssh key, a.k.a pem key)
wait for it to boot up. Once your cluster says "waiting", you're free to proceed.
[spark submission via the GUI] in the GUI, you can add a Step and select Spark job, and upload your spark file to S3, and then choose the path to that newly uploaded S3 file. Once it runs it will either succeed or fail. If it fails, wait a moment, and then click "view logs" over on the of that Step line in the list of steps. Keep tweaking your script until you've got it working.
[submission via the command line] SSH into the driver node following the ssh instructions at the top of the page. Once inside, use a command-line text editor to create a new file, and paste the contents of your script in. Then spark-submit yourNewFile.py. If it fails, you'll see the error output straight to the console. Tweak your script, and re-run. Do that until you've got it working as expected.
Note: running jobs from your local machine to a remote machine is troublesome because you may actually be causing your local instance of spark to be responsible for some expensive computations and data transfer over the network. So thats why you want to submit AWS EMR jobs from within EMR.
There are typical two ways to run a job on an Amazon EMR cluster (whether for Spark or other job types):
Login to the master node an run Spark jobs interactively. See: Access the Spark Shell
Submit jobs to the EMR cluster. See: Adding a Spark Step
If you have Apache Zeppelin installed on your EMR cluster, you can use a web browser to interact with Spark.
The error you are experiencing is saying that files where accessed via the s3n: protocol, which requires AWS credentials to be provided. If, instead, the files were accessed via s3:, I suspect that the credentials would be sourced from the IAM Role that is automatically assigned to nodes in the cluster and this error would be resolved.