Is PygreSQL available on AWS Glue Spark Jobs? - amazon-web-services

I tried using PygreSQL modules
import pg
import pgdb
but it says the modules were not found when running on AWS Glue Spark.
Their Developer Guide, https://docs.aws.amazon.com/glue/latest/dg/glue-dg.pdf, says it's available for Python Shell though.
Can anyone else confirm this?
Is there a page I can refer to for what libraries that come by default for the Python environment?
Is there an alternative to a PostgreSQL library for running on Spark Glue jobs? I know it is possible to use an external library by importing into S3 and adding the path in the configurations but I would like to avoid as many manual steps as possible.

The document that you have shared is talking about libraries only intended for python shell jobs. If you want this library in a Glue spark job then you need to package it then upload to s3 and import it in your Glue job.
There are alternatives like pg8000 which can also be used as external python library.This and this talks more about on how you can package it which can also be used with pygresql library.
Also this has more information on how you can connect to on-prem postgresql databases.

Related

Installing New Drivers for AWS Glue+PyODBC

I need to use AWS Glue to access two instances of MS SQL Server, one hosted on RDS, and one that is external. I am attempting to do this via pyodbc, which comes preinstalled/configured from Amazon. The connection fails repeatedly, so I ran the following:
import pyodbc
import sys
print("Drivers:")
dlist = pyodbc.drivers()
for drvr in dlist:
print(drvr)
This shows me that the only installed ODBC drivers are for MySQL, and PostgreSQL.
Question: is there any way to get the appropriate SQL Server driver installed such that my Glue script can see/use it? I can see in Amazon's documentation how to use the job parameters to bring over new modules similar to "pip install xyz", but that doesn't seem to bring the appropriate driver over with it.

Is it possible to install Conda packages in AWS Glue?

I’m using AWS Glue Jobs with the PythonShell mode for processing some NetCDF4 files. I need to perform some analytics on the data using the scitools-iris package.
On my local Ubuntu machine, I’ve successfully installed it using conda-forge. I tried searching the documentation on how to install the iris package using conda on AWS Glue. But I can’t find anything.
Do you recommend using a different compute option for this problem?

Using AWS EFS to install ghostscript library for camelot

recently I am using AWS EFS for python libraries and packages, instead of lambda layers(because of limitations of lambda layers as you all know)
I have integrated the EFS with lambda function well and everything is in order. I am using camelot for parsing the tables. I have installed every libraries that I need(like camelot, fitz and ...) I can use the installed libraries on EFS well. The problem that I have is with GhostScript. As you know when you want to use camelot with flavor='lattice', we need the ghostscript package. unfortunately when I use the custom aws layer for ghostscript(in my case: arn:aws:lambda:eu-west-3:764866452798:layer:ghostscript:9)
it gives back the error:
'''OSError: Ghostscript is not installed. You can install it using the instructions here: https://camelot-py.readthedocs.io/en/master/user/install-deps.html'''
my runtime on lambda is : python 3.8
Is there anyway that I can use ghostscript layer on lambda(beside this arn that I shared) or anyway to install the ghostscript package on efs to use on lambda?
I appreciate your kindness and your time to answer in advance

Can you load standard zeppelin interpreter settings from S3?

Our company is building up a suite of common internal Spark functions and jobs, and I'd like to make sure that our data scientists have access to all of these when they prototype in Zeppelin.
Ideally, I'd like a way for them to start up a Zeppelin notebook on AWS EMR, and have the dependency jar we build automatically loaded onto it without them having to manually type in the maven information manually every time (private repo location/credentials, package info, etc).
Right now we have the dependency jar loaded on S3, and with some work we could get a private maven repository to host it on.
I see that ZEPPELIN_INTERPRETER_DIR saves off interpreter settings, but I don't think it can load from a common default location (like S3, or something)
Is there a way to tell Zeppelin on an EMR cluster to load it's interpreter settings from a common location? I can't be the first person to want this.
Other thoughts I've had but have not tried yet:
Have a script that uses aws cmd line options to start a EMR cluster with all the necessary settings pre-made for you. (Could also upload the .jar dependency if we can't get maven to work)
Use a infrastructure-as-code framework to start up the clusters with the required settings.
I don't believe it's possible to tell EMR to load settings from a common location. The first thought you included is the way to go imo - you would aws emr create ... and that create would include a shell script step to replace /etc/zeppelin/conf.dist/interpreter.json by downloading the interpreter.json of interest from S3, and then hard restart zeppelin (sudo stop zeppelin; sudo start zeppelin).

Using spark with latest aws-java-sdk?

We are currently using spark 2.1 with hadoop 2.7.3 and I know and can't believe that spark still requires aws-java-sdk version 1.7.4. We are using a maven project and I was wondering if there is any way to setup libraries or my environment to be able to use spark 2.1 along with other applications that use the latest aws-java-sdk? I guess it's the same thing as asking if it's possible to setup a workflow that uses different versions of the aws-java-sdk and then when I want to run the jar on a cluster I could just point to the latest aws-java-sdk. I know I could obviously maintain to separate projects one for spark and one for pure sdk work but I'd like to just have them in the same project.
use spark 2.1 along with other applications that use the latest aws-java-sdk
You can try to use the Maven Shade Plugin when you create your JAR, then ensure the user classpath is before the Hadoop classpath (spark.executor.userClassPathFirst). This will ensure you're loading all dependencies included by Maven, not what's provided by Spark
I've done this with Avro before, but I know that the AWS SDK has more to it