I have a docker container with JupyterHub installed, running on AWS cluster, as described here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub.html. It has Python 3 kernel, PySpark 3, PySpark, SparkR, and Spark kernels, and inside the container there are installed conda and many other python packages, but no spark. The problem is that when I run pyspark or pyspark3 kernel, it connects to spark, installed on main node (outside the docker container), and all the internal modules are not available for this notebook any more (they are visible to python kernel though, but then spark is not visible in this case).
So question is how to make modules installed inside the docker to be available and visible to the pyspark/pyspark3 notebook? I think there is something in the settings I'm missing.
I'm pretty much looking for the way to use docker's internally installed modules WITH the externally installed spark in one notebook.
So far I can get only one or another.
I just found the half of the answer here https://blog.chezo.uno/livy-jupyter-notebook-sparkmagic-powerful-easy-notebook-for-data-scientist-a8b72345ea2d and here https://learn.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-notebook-kernels. The secret is to use %%local magic in the cell, which lets us access python modules installed locally (in docker container). Now i just don't know how to persist pandas dataframe created in "pyspark part" of the notebook, so it is available in "local" part.
Related
I'm developing scripts for AWS Glue, and trying to mimic development environment as close as possible to their specs here. Since it's a bit costly to run a Notebook server/development endpoint, I set everything up on local machine instead, develop scripts on VS Code Notebook, due to its usefulness.
There're some troubles with Notebook setup due to incompatible versions between installed Python & Spark.
For Python, I have gone through some harsh time to clean up, and its version is 3.8.3 now
For Spark, I use the manual method with version of 2.4.3, since I plan to use Scala alongside at later time. I install the findspark package to load that version as expected.
And it doesn't work! The error was TypeError: an integer is required (got type bytes)
I've searched around, and people said to downgrade to Python 3.7 using pyenv, and I got 3.7.7 installed but still had the same error
As a last resort, I tried pip install pyspark. it's Spark 3.0.0, and works fine, but not as expected.
Hope there's someone have experiences of this matter
A better approach would be to install the glue dependencies on docker then ssh into that docker container using VS code to mimic exact glue local dev environment.
I've written a blog about the same if you like to refer
https://towardsdatascience.com/develop-glue-jobs-locally-using-docker-containers-bffc9d95bd1
Is there a way to do it right from a cell in the notebook? similar to pip install ... --upgrade
I didn't know how to do what's instructed on https://docs.qubole.com/en/latest/faqs/general-questions/install-custom-python-libraries.html#pre-installed-python-libraries
The current Python version is 3.5.3, and Pandas 0.20.1. I need to upgrade Pandas, and Matplotlib
In Qubole are two ways to upgrade/install a package for the python environment. Currently there is no interface available inside notebook to install new packages.
New and Recommended Way (via Package Mangement) : User can enable Package Management functionality for an account and add new packages to a cluster via UI. There are lot of advantages of using package management over cluster versions in terms of performance and usability. Refer to https://docs.qubole.com/en/latest/user-guide/package-management/index.html for further details.
Old Way (via bootstrap) : User can configure a bootstrap which is basically a shell script executed on each node when the cluster starts and or upscales (more nodes are getting added to cluster). This can be configured via clusters UI and need a cluster start for every change. This is what is instructed in link you shared.
You cannot download/upgrade packages directly from the cell in the notebook. This is because your notebook is associated to a cluster. Now, to ensure that all the nodes of the cluster have the package installed, you must either use the package management (https://docs.qubole.com/en/latest/user-guide/package-management/package-management-environment.html) or the cluster's node bootstrap (https://docs.qubole.com/en/latest/user-guide/clusters/run-scripts-cluster.html#examples-node-scripts).
Do let me know if you have any further questions.
I decided to try and use Google Cloud Datalab for a small project that I'm working on rather than a Jupyter Notebook in an Anaconda environment on an AWS instance.
How can I install a package (for example OpenCV) onto the Datalab VM so that I don't have to reinstall it every time I restart my VM? Why do the packages disappear after every restart but the updated notebooks remain persistent? Any help answering these questions and clarifying how the Datalab VM works would be very helpful.
The notebooks are stored in a docker volume mount that represents a location on the persistent disk that is maintained across restarts of the VM.
The packages you install however are stored in the running container and hence lost on each restart.
You could create a custom docker image and use that instead. On the datalab create command, see the --image-name argument.
Here is an example of a Dockerfile you'll want to use:
FROM gcr.io/cloud-datalab/datalab:latest
RUN pip install opencv
Note that you'll need build the docker image using this docker file, and push the image to Google Container Registry. My memory is a bit fuzzy on this, but it is possible this image needs to be marked as public.
Hope that helps!
I'm able to run Spark on AWS EMR without much trouble following the documentation but from what I see it always uses YARN instead of the standalone manager. Is there any way to use the standalone mode instead of YARN easily? I don't really feel like hacking the bootstrap scripts to turn off yarn and deploy spark master/workers myself.
I'm running into a weird YARN related bug and I was hoping it won't happen with standalone manager.
As far as I know there are no way to run in standalone mode on EMR unless you go back to the old ami-versions instead of using the emr-release-label. The old ami-version will however cause other problems with newer versions of Spark, so I wouldn't go that way.
What you can do is to launch ordinary EC2-instances with Spark instead of using EMR. If you have a local Spark installation, go to the ec2 folder and use spark-ec2 to launch the cluster, like this:
./spark-ec2 --copy-aws-credentials --key-pair=MY_KEY --identity-file=MY_PEM_FILE.pem --region=MY_PREFERED_REGION --instance-type=INSTANCE_TYPE --slaves=NUMBER_OF_SLAVES --hadoop-major-version=2 --ganglia launch NAME_OF_JOB
I suspect that you have jar-files that are needed, so they have to be copied onto the cluster (copy to master first, ssh to master and copy them onto the slaves from there. ./spark-ec2/copy-dir on master will copy a directory onto all slaves). Then restart Spark:
./spark/sbin/stop-master.sh
./spark/sbin/stop-slaves.sh
./spark/sbin/start-master.sh
./spark/sbin/start-slaves.sh
and you are ready to launch Spark in standalone mode:
./spark/bin/spark-submit --deploy-mode client ...
I'm a complete beginner on spark. I'm trying to run spark on Amazon EC2, but my system does not recognize "spark-ec2" or "./spark-ec2". It says "spark-ec2" is not recognized as an internal or external command.
I followed the instruction here to launch a cluster. I would like to use Scala, how do I make it work?
Add PYTHON PATH environment variable with boto.
PYTHONPATH="${SPARK_EC2_DIR}/third_party/boto-2.4.1.zip/boto-2.4.1:$PYTHONPATH"
And execute the python script
In order to run the Spark-EC2 script on Windows you need Cygwin and Python. If you don't want to install these programs, you can use the dockerized version of the script (https://github.com/edrevo/spark-ec2-docker), which only depends on Docker.