How do I upgrade a library in Qubole's Jupyter Notebook, using PySpark? - qubole

Is there a way to do it right from a cell in the notebook? similar to pip install ... --upgrade
I didn't know how to do what's instructed on https://docs.qubole.com/en/latest/faqs/general-questions/install-custom-python-libraries.html#pre-installed-python-libraries
The current Python version is 3.5.3, and Pandas 0.20.1. I need to upgrade Pandas, and Matplotlib

In Qubole are two ways to upgrade/install a package for the python environment. Currently there is no interface available inside notebook to install new packages.
New and Recommended Way (via Package Mangement) : User can enable Package Management functionality for an account and add new packages to a cluster via UI. There are lot of advantages of using package management over cluster versions in terms of performance and usability. Refer to https://docs.qubole.com/en/latest/user-guide/package-management/index.html for further details.
Old Way (via bootstrap) : User can configure a bootstrap which is basically a shell script executed on each node when the cluster starts and or upscales (more nodes are getting added to cluster). This can be configured via clusters UI and need a cluster start for every change. This is what is instructed in link you shared.

You cannot download/upgrade packages directly from the cell in the notebook. This is because your notebook is associated to a cluster. Now, to ensure that all the nodes of the cluster have the package installed, you must either use the package management (https://docs.qubole.com/en/latest/user-guide/package-management/package-management-environment.html) or the cluster's node bootstrap (https://docs.qubole.com/en/latest/user-guide/clusters/run-scripts-cluster.html#examples-node-scripts).
Do let me know if you have any further questions.

Related

AWS Parallel Cluster software installation

I am very new to generic HPC concepts, and recently I need to use AWS parallel cluster to conduct some large-scale parallel computation.
I went through this tutorial and successfully build a cluster with the Slurm scheduler. I can successfully log in to the system with ssh. But I got stuck here. I need to install some software but I can't determine how to. Should I do a sudo apt-get install xxx and expect it is installed on every new node instantiated whenever there is a job scheduled? On one hand, it sounds like magic, but on the other hand, are the master node and new nodes initiated sharing the same storage? If so, apt-get install might work as they are using the same file system. It seems the Internet has very little material about it.
To conclude, my question is: if I want to install packages on the cluster I created on AWS, am I able to use sudo apt-get install xxx to do it? Are the new nodes instantiated sharing the same storage as the head node? If so, is it a good practice to do it? If not, what's the right way?
Thank you very much!
On a Parallelcluster deployed cluster, the /home directory of the head node is shared by default as an NFS share across all compute nodes. So if you just install your application in the user folder (ec2-user home folder) it will be available to all compute nodes. Once you install your application you could just run your application using the scheduler.
You may have the question next that the /home is limited in space, that's why it is recommended to have an additional shared storage volume that you can attach to the head node during cluster creation this allows you to control the attributes of the shared storage such as size, type etc.. and use it. for more details here is the Parallelcluster documentation around Shared storage configuration section
https://docs.aws.amazon.com/parallelcluster/latest/ug/SharedStorage-v3.html
Using an additional shared storage is the recommended way to run your production workloads as you have better control over the storage volume attributes. However for getting started you could just try running from your home folder first.
Thanks

Can you load standard zeppelin interpreter settings from S3?

Our company is building up a suite of common internal Spark functions and jobs, and I'd like to make sure that our data scientists have access to all of these when they prototype in Zeppelin.
Ideally, I'd like a way for them to start up a Zeppelin notebook on AWS EMR, and have the dependency jar we build automatically loaded onto it without them having to manually type in the maven information manually every time (private repo location/credentials, package info, etc).
Right now we have the dependency jar loaded on S3, and with some work we could get a private maven repository to host it on.
I see that ZEPPELIN_INTERPRETER_DIR saves off interpreter settings, but I don't think it can load from a common default location (like S3, or something)
Is there a way to tell Zeppelin on an EMR cluster to load it's interpreter settings from a common location? I can't be the first person to want this.
Other thoughts I've had but have not tried yet:
Have a script that uses aws cmd line options to start a EMR cluster with all the necessary settings pre-made for you. (Could also upload the .jar dependency if we can't get maven to work)
Use a infrastructure-as-code framework to start up the clusters with the required settings.
I don't believe it's possible to tell EMR to load settings from a common location. The first thought you included is the way to go imo - you would aws emr create ... and that create would include a shell script step to replace /etc/zeppelin/conf.dist/interpreter.json by downloading the interpreter.json of interest from S3, and then hard restart zeppelin (sudo stop zeppelin; sudo start zeppelin).

AWS JupyterHub pyspark notebook to use pandas module

I have a docker container with JupyterHub installed, running on AWS cluster, as described here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub.html. It has Python 3 kernel, PySpark 3, PySpark, SparkR, and Spark kernels, and inside the container there are installed conda and many other python packages, but no spark. The problem is that when I run pyspark or pyspark3 kernel, it connects to spark, installed on main node (outside the docker container), and all the internal modules are not available for this notebook any more (they are visible to python kernel though, but then spark is not visible in this case).
So question is how to make modules installed inside the docker to be available and visible to the pyspark/pyspark3 notebook? I think there is something in the settings I'm missing.
I'm pretty much looking for the way to use docker's internally installed modules WITH the externally installed spark in one notebook.
So far I can get only one or another.
I just found the half of the answer here https://blog.chezo.uno/livy-jupyter-notebook-sparkmagic-powerful-easy-notebook-for-data-scientist-a8b72345ea2d and here https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-notebook-kernels. The secret is to use %%local magic in the cell, which lets us access python modules installed locally (in docker container). Now i just don't know how to persist pandas dataframe created in "pyspark part" of the notebook, so it is available in "local" part.

Trying Dask on AWS

I am a scientist who is exploring the use of Dask on Amazon Web Services. I have some experience with Dask, but none with AWS. I have a few large custom task graphs to execute, and a few colleagues who may want to do the same if I can show them how. I believe that I should be using Kubernetes with Helm because I fall into the "Try out Dask for the first time on a cloud-based system like Amazon, Google, or Microsoft Azure" category.
I also fall into the "Dynamically create a personal and ephemeral deployment for interactive use" category. Should I be trying native Dask-Kubernetes instead of Helm? It seems simpler, but it's hard to judge the trade-offs.
In either case, how do you provide Dask workers a uniform environment that includes your own Python packages (not on any package index)? The solution I've found suggests that packages need to be on a pip or conda index.
Thanks for any help!
Use Helm or Dask-Kubernetes ?
You can use either. Generally starting with Helm is simpler.
How to include custom packages
You can install custom software using pip or conda. They don't need to be on PyPI or the anaconda default channel. You can point pip or conda to other channels. Here is an example installing software using pip from github
pip install git+https://github.com/username/repository#branch
For small custom files you can also use the Client.upload_file method.

Can I run gcloud components update?

Will updating gcloud components from within my Google Cloud Shell instance persist?
Will updating anything, like Go or NPM, that is pre-installed with Google Cloud Shell persist?
Yes, depending upon where you install those tools.
When you init a new cloud shell, you get a disk for yourself, and the system image is constructed using a template. So any changes that you do to your disk will persist, while anything you do to core image, will not.
All the pre-installed tools are part of the system image that is updated for all the users and is maintained by GCP team. If you are updating or switching versions there, they will not persist.
But if you want to install custom tools, or switch to a specific version, you can install those tools at your $HOME. All those tools will be installed in your disk and hence will persist across termination/relaunches.