Is it possible to install Conda packages in AWS Glue? - amazon-web-services

I’m using AWS Glue Jobs with the PythonShell mode for processing some NetCDF4 files. I need to perform some analytics on the data using the scitools-iris package.
On my local Ubuntu machine, I’ve successfully installed it using conda-forge. I tried searching the documentation on how to install the iris package using conda on AWS Glue. But I can’t find anything.
Do you recommend using a different compute option for this problem?

Related

Using AWS EFS to install ghostscript library for camelot

recently I am using AWS EFS for python libraries and packages, instead of lambda layers(because of limitations of lambda layers as you all know)
I have integrated the EFS with lambda function well and everything is in order. I am using camelot for parsing the tables. I have installed every libraries that I need(like camelot, fitz and ...) I can use the installed libraries on EFS well. The problem that I have is with GhostScript. As you know when you want to use camelot with flavor='lattice', we need the ghostscript package. unfortunately when I use the custom aws layer for ghostscript(in my case: arn:aws:lambda:eu-west-3:764866452798:layer:ghostscript:9)
it gives back the error:
'''OSError: Ghostscript is not installed. You can install it using the instructions here: https://camelot-py.readthedocs.io/en/master/user/install-deps.html'''
my runtime on lambda is : python 3.8
Is there anyway that I can use ghostscript layer on lambda(beside this arn that I shared) or anyway to install the ghostscript package on efs to use on lambda?
I appreciate your kindness and your time to answer in advance

How do I upgrade a library in Qubole's Jupyter Notebook, using PySpark?

Is there a way to do it right from a cell in the notebook? similar to pip install ... --upgrade
I didn't know how to do what's instructed on https://docs.qubole.com/en/latest/faqs/general-questions/install-custom-python-libraries.html#pre-installed-python-libraries
The current Python version is 3.5.3, and Pandas 0.20.1. I need to upgrade Pandas, and Matplotlib
In Qubole are two ways to upgrade/install a package for the python environment. Currently there is no interface available inside notebook to install new packages.
New and Recommended Way (via Package Mangement) : User can enable Package Management functionality for an account and add new packages to a cluster via UI. There are lot of advantages of using package management over cluster versions in terms of performance and usability. Refer to https://docs.qubole.com/en/latest/user-guide/package-management/index.html for further details.
Old Way (via bootstrap) : User can configure a bootstrap which is basically a shell script executed on each node when the cluster starts and or upscales (more nodes are getting added to cluster). This can be configured via clusters UI and need a cluster start for every change. This is what is instructed in link you shared.
You cannot download/upgrade packages directly from the cell in the notebook. This is because your notebook is associated to a cluster. Now, to ensure that all the nodes of the cluster have the package installed, you must either use the package management (https://docs.qubole.com/en/latest/user-guide/package-management/package-management-environment.html) or the cluster's node bootstrap (https://docs.qubole.com/en/latest/user-guide/clusters/run-scripts-cluster.html#examples-node-scripts).
Do let me know if you have any further questions.

Error while installing (bootstrapping) latest Spark on latest AWS EMR (5.13.X)

I have been trying to install Spark on latest EMR((5.13.X)cluster via bootstrapping using the following with Terraform, but not getting successful. Any ready to use latest Spark/emr version bootable script or other solution to do using Terraform?
bootstrap_action = {
path = "s3://support.elasticmapreduce/spark/install-spark"
name = "install-spark"
args = ["instance.isMaster=true", "echo running on master node"]}
That install-spark bootstrap action hasn't worked since before Spark was officially supported as an application on AMI version 3.9.0 about three years ago. Also, bootstrap actions built for AMI version 3.x and earlier do not work at all with release labels emr-4.x and emr-5.x+.
Instead, to install Spark on emr-4.x or emr-5.x, you simply include "Spark" in the list of Applications of the RunJobFlowRequest.
I have not used Terraform to create an EMR cluster, but the example I found at https://www.terraform.io/docs/providers/aws/r/emr_cluster.html shows exactly how to create a cluster with Spark.

How to setup and use Kafka-Connect-HDFS in HDP 2.4

I want to use kafka-connect-hdfs on hortonworks 2.4. Can you please help me with the steps i need to follow to setup in HDP env.
Other than building Kafka Connect HDFS from source, you can download and extract Confluent Platform's TAR.GZ files on your Hadoop nodes. That doesn't mean you are "installing Confluent"
Then you can cd /path/to/confluent-x.y.z/
And run Kafka Connect from there.
./bin/connect-standalone ./etc/kafka/connect-standalone.properties ./etc/kafka-connect-hdfs/quickstart-hdfs.properties
If that is working for you, then in order to run connect-distributed (the recommended way to run Kafka Connect), you need to download the same thing on the rest of the machines you want to run Kafka Connect on.

Zeppelin: how to download or save the Zeppelin notebook?

I am using Zeppelin sandbox with aws EMR.
Is there a way to download or save the zeppelin notebook in a way so that it can be imported into another Zeppelin server ?
As noted in the comments above, this feature is available starting in version 0.5.6. You can find more details in the release notes. Downloading and installing this version would solve that issue.
Given that you are using EMR, it looks like you will have to work with the version available. As Samuel mentioned above, you can backup the contents of the incubator-zeppelin/notebook folder and make the transfer.