Installing New Drivers for AWS Glue+PyODBC - amazon-web-services

I need to use AWS Glue to access two instances of MS SQL Server, one hosted on RDS, and one that is external. I am attempting to do this via pyodbc, which comes preinstalled/configured from Amazon. The connection fails repeatedly, so I ran the following:
import pyodbc
import sys
print("Drivers:")
dlist = pyodbc.drivers()
for drvr in dlist:
print(drvr)
This shows me that the only installed ODBC drivers are for MySQL, and PostgreSQL.
Question: is there any way to get the appropriate SQL Server driver installed such that my Glue script can see/use it? I can see in Amazon's documentation how to use the job parameters to bring over new modules similar to "pip install xyz", but that doesn't seem to bring the appropriate driver over with it.

Related

Upgrade Apache Superset 0.19.0 to the latest (1.3.2?)

We want to upgrade our old superset installation (0.19.0) to the latest version (1.3.2 or 1.4.0), and are just wondering whether there is anything we need to do to migrate the existing dashboards?
All meta data sits in AWS RDS database. Do we need to make any change in the DB, or just simply connect to it from the new Superset version?
Export all the dashboards, connect the to the database from the new supersert version and import the dashboards to the new environment

Is PygreSQL available on AWS Glue Spark Jobs?

I tried using PygreSQL modules
import pg
import pgdb
but it says the modules were not found when running on AWS Glue Spark.
Their Developer Guide, https://docs.aws.amazon.com/glue/latest/dg/glue-dg.pdf, says it's available for Python Shell though.
Can anyone else confirm this?
Is there a page I can refer to for what libraries that come by default for the Python environment?
Is there an alternative to a PostgreSQL library for running on Spark Glue jobs? I know it is possible to use an external library by importing into S3 and adding the path in the configurations but I would like to avoid as many manual steps as possible.
The document that you have shared is talking about libraries only intended for python shell jobs. If you want this library in a Glue spark job then you need to package it then upload to s3 and import it in your Glue job.
There are alternatives like pg8000 which can also be used as external python library.This and this talks more about on how you can package it which can also be used with pygresql library.
Also this has more information on how you can connect to on-prem postgresql databases.

Connect to a mysql source in Cloud Dataflow without requirements

Is there a package that is currently installed in the Python SDK that would allow me to connect to a mysql source? If not, I'll need to add in a requirements.txt file, which I'm trying to eliminate, as it drastically increases the setup time for things.
Update: I suppose pandas can, though I believe it needs an additional 'binding' for each sql source it connects to if I'm not mistaken?.
Since you are trying to connect to MySQL you need a specific client that will establish a channel between you and the database. Therefore, you will have to use the requirements.txt file to install this library.
You can refer to this StuckOverflow link that has a similar question. The answer specifies that "You must install a MySQL driver before doing anything. Unlike PHP, Only the SQLite driver is installed by default with Python. ...".
So only the SQLite driver is installed with Python SDK, not the MySQL one.

How to set up a local development environment for PySpark ETL to run in AWS Glue?

PyCharm professional supports connecting,
deploying and remote debugging of AWS Glue developer endpoint (https://docs.aws.amazon.com/glue/latest/dg/dev-endpoint-tutorial-pycharm.html) , but I can't figure out how to use VS Code (my code editor of choice) for this purpose. Does VS Code support any of these functionalities? Or is there another free alternative to PyCharm professional with the same capabilities?
I have not use pyCharm, but have setup a local Development End Point with Zeppelin, for my Glue jobs development / testing. Please see my related posts & references for setting up local development end point. Maybe you can try it, if it is useful, and you can try to use pyCharm instead of Zeppelin.
Reference : Is it possible to use Jupyter Notebook for AWS Glue instead of Zeppelin & Link for zeppelin local development endpoint SO discussions

How to create a working JDBC connection in Google Cloud Composer?

To get the JDBC Hook working, I first add in the jaydebeapi package in the PYPI packages page in Composer.
However, that alone does not allow a JDBC connection to work:
1) How do I specify the .jar driver path for the JDBC driver I have?
I was thinking it would be something like "/home/airflow/gcs/drivers/xxx.jar" (assuming I've created a drivers folder in the gcs directory)... but I haven't been able to verify or find documentation on this.
2)How do I install/point toward the Java JRE? On Ubuntu I run this command to install JRE: sudo apt-get install default-jre libc6-i386. Is a JRE or ability to install a JRE available in Cloud Composer? This is the current error message I get in the Adhoc window with the JDBC connection: [Errno 2] No such file or directory: '/usr/lib/jvm'
If either of the above options are not currently available, are there any workarounds to get a JDBC connection working with Composer?
There are known issues with JDBC issues in Airflow 1.9 (https://github.com/apache/incubator-airflow/pull/3257); hopefully, we should be able to backport these fixes in Composer by GA!