Is there an easier way than using Lambda/Boto to call a stored procedure in Redshift from AWS Glue?
I have everything setup in a Glue job and need to call a stored procedure in Redshift from the Spark script. I have a connection made to Redshift in Glue.
This question does not have the answer: Calling stored procedure from aws Glue Script
Please share any guidance on this.
Thank you.
you can do that using py4j and java jdbc. The best part you do not even have to install anything as Glue comes with jdbc connectors for many supported databases.
How to run arbitrary / DDL SQL statements or stored procedures using AWS Glue
You can zip pg8000 as additional library and use it to establish connection to redshift then trigger the stored procedure.
It is not possible to trigger stored procedure from spark jdbc.
Related
We know that,
the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it.
Is there any other way of writing to aws glue data catalog?
I am looking for a direct way to do this.Eg. writing as a s3 file and sync to the aws glue data catalog.
You may manually specify the table. The crawler only discovers the schema. If you set the schema manually, you should be able to read your data when you run the AWS Glue Job.
We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :
A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
The queries fired :
alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’)
4. This was triggered to run post the main Ingestion Job via Glue Workflows.
If you are not expecting schema to change, use Glue job directly after creating manually tables using Glue Database and Table.
I'm trying to understand how to properly connect Redshift Spectrum with Hudi data.
Looks like I can directly create Redshift external table for data managed in Apache Hudi like it is described by the following documentation https://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html The other way is to integrate Hudi with AWS Glue Data Catalog like it is mentioned here https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-how-it-works.html and then access Hudi tables with Redshift Spectrum via AWS Glue Data Catalog.
The same needs I have for AWS EMR for Apache Spark. Looks like I may use Hudi directly from EMR or via AWS Glue Data Catalog.
Right now, I don't understand what way to choose. Could you please advise what is the benefit to use Hudi via AWS Glue Data Catalog, or do I need to use it directly from Redshift Spectrum and AWS EMR ?
Given that with Spark on EMR you need a catalog, Hive metastore if you will, then using the AWS Glue Catalog is an option.
If you elect to use Glue as metastore then use that as the source for all data. Unless errors are evident in which case use the Hudi api for Spark.
I want to run a GLUE job and the stored procedure code starts. I have a SQL stored procedure and want to call it in the autogenerated pyspark code for the GLUE job. I don't want to use redshift or snowflake or anything like that if it can be helped. Any thoughts?
AWS GLUE actually runs a python code. You can put your logic in the python code and input this to glue on the AWS Console. Please see the screenshot below
I need to access some tables which are there in AWS Glue which i am using as a metastore. I wanted to know if Glue provides any jdbc endpoint to connect to it just like HIVE does.
I understand that it is possible to read data into AWS glue from other databases like MYSQL, Oracle etc using JDBC but my requirement is opposite and i have to read from AWS glue using JDBC. Please help if it is possible as I could not find a reference for this.
For accessing the data from glue catalog, follow these steps:
Run the crawler and update the table in glue catalog.
To access these tables using JDBC or ODBC endpoint, you need athena.
Download the driver from this link.
Read the docs for creating the url according to your region here
Also go through this documentation for additional properties
Hope it helps
I am new in AWS. I want to use AWS glue for ETL process.
Could we use AWS glue for analyzing the RDS database and store the analyzed data into rds mysql table using ETL job
Thanks
Yes, its possible. We have used S3 to store our raw data, from where we read the data in AWS Glue, and perform UPSERTs to RDS Aurora as part of our ETL process. You can either use AWS Glue trigger or a Lambda S3 event triggers for calling the glue job.
We have used pymysql / mysql.connector in AWS Glue since we have to do UPSERTs. Bulk load data directly from S3 is also supported for RDS Mysql (Aurora). Let me know if you need help with code sample