I want to run a GLUE job and the stored procedure code starts. I have a SQL stored procedure and want to call it in the autogenerated pyspark code for the GLUE job. I don't want to use redshift or snowflake or anything like that if it can be helped. Any thoughts?
AWS GLUE actually runs a python code. You can put your logic in the python code and input this to glue on the AWS Console. Please see the screenshot below
Related
We know that,
the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it.
Is there any other way of writing to aws glue data catalog?
I am looking for a direct way to do this.Eg. writing as a s3 file and sync to the aws glue data catalog.
You may manually specify the table. The crawler only discovers the schema. If you set the schema manually, you should be able to read your data when you run the AWS Glue Job.
We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :
A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
The queries fired :
alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’)
4. This was triggered to run post the main Ingestion Job via Glue Workflows.
If you are not expecting schema to change, use Glue job directly after creating manually tables using Glue Database and Table.
I am trying to automate an ETL pipeline that outputs data from AWS RDS MYSQL to AWS S3. I am currently using AWS Glue to do the job. When I do an initial load from RDS to S3. It captures all the data in the file which is exactly what I want. However, when I add new data to the MYSQL database and run the Glue job again. I get an empty file instead of the added rows. Any help would be MUCH appreciated.
Bookmarking rules for JDBC Sources are here. Important point to remember for JDBC sources is that values have to be increasing or decreasing order and Glue only processes new data from last checkpoint.
Typically, either an autogenerated sequence number or a datatime used as key for bookmarking
For anybody who is still struggling with this (it drove me mad, because i thought my spark code was wrong), disable bookmarking in job details.
Hello anyone know how to write glue job script to run all tables in glue?
Please need help in understanding how to call all database and tables in glue?
I am trying to create job script using Java. In AWS Glue Console, I could be able to find only "Python, Spark", so which means we cant write script using Java at all? If yes, then whats this api used for: aws-java-sdk-glue
I even found some example: https://stackoverflow.com/questions/48256281/how-to-read-aws-glue-data-catalog-table-schemas-programmatically
In above, seems like we can able to write aws glue script in Java too. Can anyone please confirm this?
EDIT:
In Scala, we are writing as: glueContext.getCatalogSource(database = "my_data_base", tableName = "my_table")
In Java, I found below class, which has method names: withDatabaseName and withTableName
https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/glue/model/CatalogEntry.html
Then, may I know what is the purpose of above class?
The language option on the Glue console that you see is the script/code that yoiu will write to extract, transform and load the actual data that needs to be processed. The source can be a db or s3 bucket and destination can be anything depending on your use case.
Normally you can create a Glue job or a S3 bucket from AWS Management console and when you don't want to do this manually then you need a SDK which has the API call definitions that you use to create AWS resources.
So the script inside a Glue job can be written only in python or scala but when it comes to creating a Glue job you can use different languages/SDKs.
Java - https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/glue/AWSGlueClient.html
Python - https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue.html
Java script - https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Glue.html
Ruby - https://docs.aws.amazon.com/sdk-for-ruby/v3/api/Aws/Glue/Client.html
Above all are SDKs used to define resources in AWS where as refer to below link which has the actual code used inside a Glue job.
https://github.com/aws-samples/aws-glue-samples
Java is not supported for the actual script definition of AWS Glue jobs.
The API that you are referring to is the AWS SDK that will allow you to create and manage AWS Glue resources such as creating/running crawlers, viewing and manage the glue catalogues, creating job definitions, etc.
So you can manage resources in the Glue service with the AWS SDK for Java similar to how to you manage resources in EC2, S3, RDS with the AWS SDK for Java.
Is there an easier way than using Lambda/Boto to call a stored procedure in Redshift from AWS Glue?
I have everything setup in a Glue job and need to call a stored procedure in Redshift from the Spark script. I have a connection made to Redshift in Glue.
This question does not have the answer: Calling stored procedure from aws Glue Script
Please share any guidance on this.
Thank you.
you can do that using py4j and java jdbc. The best part you do not even have to install anything as Glue comes with jdbc connectors for many supported databases.
How to run arbitrary / DDL SQL statements or stored procedures using AWS Glue
You can zip pg8000 as additional library and use it to establish connection to redshift then trigger the stored procedure.
It is not possible to trigger stored procedure from spark jdbc.