Data change Capture in Redshift using AWS Glue script - amazon-web-services

I have used a "For in" loop script in AWS Glue to move 70 tables from S3 to Redshift. But, When I run the script again and again, data is being duplicated. I have seen one document as a solution for this.
https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/
But, In my case, As I am using "for in" loop script for moving tables together, How can I Make use of creating staging table concept as in the document?
Here is the script I am using for moving tables to redshift:
client = boto3.client("glue", region_name="us-east-1")
databaseName = "db1_g"
Tables = client.get_tables(DatabaseName=databaseName)
tableList = Tables["TableList"]
for table in tableList:
tableName = table["Name"]
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="db1_g", table_name=tableName, transformation_ctx="datasource0"
)
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=datasource0,
catalog_connection="redshift",
connection_options={
"dbtable": f"schema1.{tableName}",
"database": "db1",
},
redshift_tmp_dir=args["TempDir"],
transformation_ctx="datasink4",
)
job.commit()
Is there any way to avoid duplicating data when we use loop script for moving tables?

Related

AWS Glue Dynamic_frame with pushdown predicate not filtering correctly

I am writing an script for AWS Glue that is sourced in S3 stored parquet files, in which I am creating a DynamicFrame and attempting to use pushDownPredicate logic to restrict the data coming in.
The table partitions are (in order): account_id > region > vpc_id > dt
And the code for creating the dynamic_frame is the following:
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database = DATABASE_NAME,
table_name= TABLE_NAME,
push_down_predicate = "dt='" + DATE + "'")
where DATE = '2019-10-29'
However it seems that Glue still attempts to read data from other days. Maybe it's because I have to specify a push_down_predicate for the other criteria?
As per the comments, the logs show that the date partition column is marked as "dt" where as in your table it is being referred by the name "date"
Logs
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY/dt=2019-07-15
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-10-03
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-08-27
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-10-29 ...
Your Code
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database = DATABASE_NAME,
table_name= TABLE_NAME,
push_down_predicate = "date='" + DATE + "'")
Change the date partitions column name to dt in your table and same in push_down_predicate parameter in the above code.
I also see extra forward slashes in some of the paths in above logs, were these partitions added manually through athena using ALTER TABLE command? If so, I would recommend to use MSCK REPAIR command to load all partitions in the table to avoid such issues. Extra blank slashes in S3 path some times lead to errors while doing ETL through spark.

How to partition data by datetime in AWS Glue?

The current set-up:
S3 location with json files. All files stored in the same location (no day/month/year structure).
Glue Crawler reads the data in a catalog table
Glue ETL job transforms and stores the data into parquet tables in s3
Glue Crawler reads from s3 parquet tables and stores into a new table that gets queried by Athena
What I want to achieve is the parquet tables to be partitioned by day (1) and the parquet tables for 1 day to be in the same file (2). Currently there is a parquet table for each json file.
How would I go about it?
One thing to mention, there is a datetime column in the data, but it's a unix epoch timestamp. I would probably need to convert that to a 'year/month/day' format, otherwise I'm assuming it will create a partition for each file again.
Thanks a lot for your help!!
Convert Glue's DynamicFrame into Spark's DataFrame to add year/month/day columns and repartition. Reducing partitions to one will ensure that only one file will be written into a folder, but it may slow down job performance.
Here is python code:
from pyspark.sql.functions import col,year,month,dayofmonth,to_date,from_unixtime
...
df = dynamicFrameSrc.toDF()
repartitioned_with_new_columns_df = df
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
.withColumn(“year”, year(col(“date_col”)))
.withColumn(“month”, month(col(“date_col”)))
.withColumn(“day”, dayofmonth(col(“date_col”)))
.drop(col(“date_col”))
.repartition(1)
dyf = DynamicFrame.fromDF(repartitioned_with_new_columns_df, glueContext, "enriched")
datasink = glueContext.write_dynamic_frame.from_options(
frame = dyf,
connection_type = "s3",
connection_options = {
"path": "s3://yourbucket/data”,
"partitionKeys": [“year”, “month”, “day”]
},
format = “parquet”,
transformation_ctx = "datasink"
)
Note that the from pyspark.qsl.functions import col can give a reference error, this shouldn't be a problem as explained here.
I cannot comment so I am going to write as an answer.
I used Yuriy's code and a couple of things needed adjustment:
missing brackets
df = dynamicFrameSrc.toDF()
after toDF() I had to add select("*") otherwise schema was empty
df.select("*")
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
To achieve this in AWS Glue Studio:
You will need to make a custom function to convert the datetime field to date. There is the extra step of converting it back to a DynamicFrameCollection.
In Python:
def MyTransform(glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df_with_date = df.withColumn('date_field', df['datetime_field'].cast('date'))
glue_df = DynamicFrame.fromDF(df_with_date, glueContext, "transform_date")
return(DynamicFrameCollection({"CustomTransform0": glue_df}, glueContext))
You would then have to edit the custom transformer schema to include that new date field you just created.
You can then use the "data target" node to write the data to disk and then select that new date field to use as a partition.
video step by step walkthrough

How do I query a JDBC database within AWS Glue using a WHERE clause with PySpark?

I have a self authored Glue script and a JDBC Connection stored in the Glue catalog. I cannot figure out how to use PySpark to do a select statement from the MySQL database stored in RDS that my JDBC Connection points to. I have also used a Glue Crawler to infer the schema of the RDS table that I am interested in querying. How do I query the RDS database using a WHERE clause?
I have looked through the documentation for DynamicFrameReader and the GlueContext Class but neither seem to point me in the direction that I am seeking.
It depends on what you want to do. For example, if you want to do a select * from table where <conditions>, there are two options:
Assuming you created a crawler and inserted the source on your AWS Glue job like this:
# Read data from database
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "students", redshift_tmp_dir = args["TempDir"])
AWS Glue
# Select the needed fields
selectfields1 = SelectFields.apply(frame = datasource0, paths = ["user_id", "full_name", "is_active", "org_id", "org_name", "institution_id", "department_id"], transformation_ctx = "selectfields1")
filter2 = Filter.apply(frame = selectfields1, f = lambda x: x["org_id"] in org_ids, transformation_ctx="filter2")
PySpark + AWS Glue
# Change DynamicFrame to Spark DataFrame
dataframe = DynamicFrame.toDF(datasource0)
# Create a view
dataframe.createOrReplaceTempView("students")
# Use SparkSQL to select the fields
dataframe_sql_df_dim = spark.sql("SELECT user_id, full_name, is_active, org_id, org_name, institution_id, department_id FROM assignments WHERE org_id in (" + org_ids + ")")
# Change back to DynamicFrame
selectfields = DynamicFrame.fromDF(dataframe_sql_df_dim, glueContext, "selectfields2")

Is it possible to write a Partitioned DataFrame into S3 bucket?

I have to write a Spark DataFrame into S3 bucket and it should create a separate parquet file for each partition.
Here is my code:
dynamicDataFrame = DynamicFrame.fromDF(
testDataFrame, glueContext ,
"dynamicDataFrame")
glueContext.write_dynamic_frame.from_options(
frame = dynamicDataFrame,
connection_type = "s3",
connection_options = {
"path": "s3://BUCKET_NAME/DIR_NAME",
"partitionKeys": ["COL_NAME"]
},
format = "parquet"
)
When I specify "partitionKeys": ["COL_NAME"] option then Glue Job gets executed without any error but it does not create any file in S3.
And when I remove this "partitionKeys" option then it creates 200 parquet files in S3(default No Of Partition is 200).
But I want to create partitions on the basis of a particular column.
So, is it possible to create partition wise parquet files in S3 while writing a DF in S3?
Note: I am using AWS resources i.e. AWS Glue.
R you sure partition column has data?
Do you find anything in glue logs

Upsert from AWS Glue to Amazon Redshift

I understand that there is no direct UPSERT query one can perform directly from Glue to Redshift. Is it possible to implement the staging table concept within the glue script itself?
So my expectation is creating the staging table, merging it with destination table and finally deleting it. Can it be achieved within the Glue script?
It is possible to implement upsert into Redshift using staging table in Glue by passing 'postactions' option to JDBC sink:
val destinationTable = "upsert_test"
val destination = s"dev_sandbox.${destinationTable}"
val staging = s"dev_sandbox.${destinationTable}_staging"
val fields = datasetDf.toDF().columns.mkString(",")
val postActions =
s"""
DELETE FROM $destination USING $staging AS S
WHERE $destinationTable.id = S.id
AND $destinationTable.date = S.date;
INSERT INTO $destination ($fields) SELECT $fields FROM $staging;
DROP TABLE IF EXISTS $staging
"""
// Write data to staging table in Redshift
glueContext.getJDBCSink(
catalogConnection = "redshift-glue-connections-test",
options = JsonOptions(Map(
"database" -> "conndb",
"dbtable" -> staging,
"overwrite" -> "true",
"postactions" -> postActions
)),
redshiftTmpDir = s"$tempDir/redshift",
transformationContext = "redshift-output"
).writeDynamicFrame(datasetDf)
Make sure the user used for writing to Redshift has sufficient permissions to create/drop tables in the staging schema.
Apparently connection_options dictionary parameter in glueContext.write_dynamic_frame.from_jdbc_conf function has 2 interesting parameters: preactions and postactions
target_table = "my_schema.my_table"
stage_table = "my_schema.#my_table_stage_table"
pre_query = """
drop table if exists {stage_table};
create table {stage_table} as select * from {target_table} LIMIT 0;""".format(stage_table=stage_table, target_table=target_table)
post_query = """
begin;
delete from {target_table} using {stage_table} where {stage_table}.id = {target_table}.id ;
insert into {target_table} select * from {stage_table};
drop table {stage_table};
end;""".format(stage_table=stage_table, target_table=target_table)
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame = datasource0, catalog_connection ="test_red", redshift_tmp_dir='s3://s3path', transformation_ctx="datasink4",
connection_options = {"preactions": pre_query, "postactions": post_query,
"dbtable": stage_table, "database": "redshiftdb"})
Based on https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/
Yes, it can be totally achievable. All you would need is to import pg8000 module into your glue job. pg8000 module is the python library which is used to make connection with Amazon Redshift and execute SQL queries through cursor.
Python Module Reference: https://github.com/mfenniak/pg8000
Then, make connection to your target cluster through pg8000.connect(user='user',database='dbname',host='hosturl',port=5439,password='urpasswrd')
And use the Glue,s datasink option to load into staging table and then run upsert sql query using pg8000 cursor
>>> import pg8000
>>> conn = pg8000.connect(user='user',database='dbname',host='hosturl',port=5439,password='urpasswrd')
>>> cursor = conn.cursor()
>>> cursor.execute("CREATE TEMPORARY TABLE book (id SERIAL, title TEXT)")
>>> cursor.execute("INSERT INTO TABLE final_target"))
>>> conn.commit()
You would need to zip the pg8000 package and put it in s3 bucket and reference it to the Python Libraries path under the Advanced options/Job parameters at Glue Job section.