Python 2 with Spark 2.0 - data-science-experience

How do we create a spark service for Python 2/or 3 with Spark 2.0 . Whenever I create a new service and associate it with a python notebook its Python 2 with Spark 1.6. Why cant I see the configuration of the service I am creating like in Data bricks free edition? I want to use the SparkSession api introduced in Spark 2.0 to create your spark session variable, hence the question.

You can choose the Python and Spark version while:
a. Creating a new notebook in Data Science Experience:
DSX `Project` --> Overview--> `+ add notebooks` --> `Choose the language` (Python2/R/Scala/Python3) and Spark version (1.6/2.0/2.1).
b. Change the kernel of an existing notebook:
From any running notebook, on the notebook menu choose `Kernel` and then choose the language and Spark version combination of your choice.

You cannot see the configuration of the service you are creating, because you're not creating a service with its own configuration. The Apache Spark as a Service instances in Bluemix and Data Science Experience are getting execution slots in a shared cluster. The configurations of that shared cluster are managed by IBM.
The Jupyter Notebook server of your instance has kernel specs for each supported combination of language and Spark version. To switch your notebook to a different combination, select "Kernel -> Change Kernel -> (whatever)". Or select language and Spark version separately when creating a notebook.

Related

How do I upgrade a library in Qubole's Jupyter Notebook, using PySpark?

Is there a way to do it right from a cell in the notebook? similar to pip install ... --upgrade
I didn't know how to do what's instructed on https://docs.qubole.com/en/latest/faqs/general-questions/install-custom-python-libraries.html#pre-installed-python-libraries
The current Python version is 3.5.3, and Pandas 0.20.1. I need to upgrade Pandas, and Matplotlib
In Qubole are two ways to upgrade/install a package for the python environment. Currently there is no interface available inside notebook to install new packages.
New and Recommended Way (via Package Mangement) : User can enable Package Management functionality for an account and add new packages to a cluster via UI. There are lot of advantages of using package management over cluster versions in terms of performance and usability. Refer to https://docs.qubole.com/en/latest/user-guide/package-management/index.html for further details.
Old Way (via bootstrap) : User can configure a bootstrap which is basically a shell script executed on each node when the cluster starts and or upscales (more nodes are getting added to cluster). This can be configured via clusters UI and need a cluster start for every change. This is what is instructed in link you shared.
You cannot download/upgrade packages directly from the cell in the notebook. This is because your notebook is associated to a cluster. Now, to ensure that all the nodes of the cluster have the package installed, you must either use the package management (https://docs.qubole.com/en/latest/user-guide/package-management/package-management-environment.html) or the cluster's node bootstrap (https://docs.qubole.com/en/latest/user-guide/clusters/run-scripts-cluster.html#examples-node-scripts).
Do let me know if you have any further questions.

Location of Sqoop installation on Amazon EMR cluster?

I started an EMR cluster in order to use test out sqoop but it turns out it doesnt seem to be installed on the latest version of EMR(5.19.0) as I didnt find it in the directory /usr/lib/sqoop. I tried 5.18.0 as well but it was missing there too.
According to the application versions page, sqoop 1.4.7 should be installed on the cluster.
The EMR console gives me a list of 4 "installations". I chose the Core Hadoop package. It has Hive, Hue, etc installed in /usr/lib. Am I missing something here? It's my first time using EMR or sqoop.
I did not see the "Advanced Options" link at the top of the "Create Cluster" page where I can select individual software to install.
When creating an EMR cluster, use the advanced options link where it allows you to select sqoop.

How to set up a local development environment for PySpark ETL to run in AWS Glue?

PyCharm professional supports connecting,
deploying and remote debugging of AWS Glue developer endpoint (https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-pycharm.html) , but I can't figure out how to use VS Code (my code editor of choice) for this purpose. Does VS Code support any of these functionalities? Or is there another free alternative to PyCharm professional with the same capabilities?
I have not use pyCharm, but have setup a local Development End Point with Zeppelin, for my Glue jobs development / testing. Please see my related posts & references for setting up local development end point. Maybe you can try it, if it is useful, and you can try to use pyCharm instead of Zeppelin.
Reference : Is it possible to use Jupyter Notebook for AWS Glue instead of Zeppelin & Link for zeppelin local development endpoint SO discussions

Error while installing (bootstrapping) latest Spark on latest AWS EMR (5.13.X)

I have been trying to install Spark on latest EMR((5.13.X)cluster via bootstrapping using the following with Terraform, but not getting successful. Any ready to use latest Spark/emr version bootable script or other solution to do using Terraform?
bootstrap_action = {
path = "s3://support.elasticmapreduce/spark/install-spark"
name = "install-spark"
args = ["instance.isMaster=true", "echo running on master node"]}
That install-spark bootstrap action hasn't worked since before Spark was officially supported as an application on AMI version 3.9.0 about three years ago. Also, bootstrap actions built for AMI version 3.x and earlier do not work at all with release labels emr-4.x and emr-5.x+.
Instead, to install Spark on emr-4.x or emr-5.x, you simply include "Spark" in the list of Applications of the RunJobFlowRequest.
I have not used Terraform to create an EMR cluster, but the example I found at https://www.terraform.io/docs/providers/aws/r/emr_cluster.html shows exactly how to create a cluster with Spark.

Spark standalone mode on AWS EMR

I'm able to run Spark on AWS EMR without much trouble following the documentation but from what I see it always uses YARN instead of the standalone manager. Is there any way to use the standalone mode instead of YARN easily? I don't really feel like hacking the bootstrap scripts to turn off yarn and deploy spark master/workers myself.
I'm running into a weird YARN related bug and I was hoping it won't happen with standalone manager.
As far as I know there are no way to run in standalone mode on EMR unless you go back to the old ami-versions instead of using the emr-release-label. The old ami-version will however cause other problems with newer versions of Spark, so I wouldn't go that way.
What you can do is to launch ordinary EC2-instances with Spark instead of using EMR. If you have a local Spark installation, go to the ec2 folder and use spark-ec2 to launch the cluster, like this:
./spark-ec2 --copy-aws-credentials --key-pair=MY_KEY --identity-file=MY_PEM_FILE.pem --region=MY_PREFERED_REGION --instance-type=INSTANCE_TYPE --slaves=NUMBER_OF_SLAVES --hadoop-major-version=2 --ganglia launch NAME_OF_JOB
I suspect that you have jar-files that are needed, so they have to be copied onto the cluster (copy to master first, ssh to master and copy them onto the slaves from there. ./spark-ec2/copy-dir on master will copy a directory onto all slaves). Then restart Spark:
./spark/sbin/stop-master.sh
./spark/sbin/stop-slaves.sh
./spark/sbin/start-master.sh
./spark/sbin/start-slaves.sh
and you are ready to launch Spark in standalone mode:
./spark/bin/spark-submit --deploy-mode client ...