When using Relationalize in Glue there is no id in root table - amazon-web-services

I have a DynamicFrame in Glue and I am using the Relationalize method which creates me 3 new dynamic frames; root_table, root_table_1 and root_table_2.
When I print the Schema of the tables or after I inserted the tables in database I noticed that in the root_table the id is missing so I cannot make joins between the root_table and other tables.
I tried all the possible combinations.
Is there something i missing?
datasource1 = Relationalize.apply(frame = renameId, name = "root_ds", transformation_ctx = "datasource1")
print(datasource1.keys())
print(datasource1.values())
for df_name in datasource1.keys():
m_df = datasource1.select(df_name)
print "Writing to Redshift table: ", df_name
m_df.printSchema()
glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df, catalog_connection = "Redshift", connection_options = {"database" : "redshift", "dbtable" : df_name}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "df_to_db")

I used the code below (removing the import bits) on your data and got wrote into S3. I got two files as pasted after the code. I am reading from the glue catalog after running the crawler on your data.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "json_aws_glue_relationalize_stackoverflow", transformation_ctx = "datasource0")
dfc = datasource0.relationalize("advertise_root", "s3://aws-glue-temporary-009551040880-ap-southeast-2/")
for df_name in dfc.keys():
m_df = dfc.select(df_name)
print "Writing to S3 file: ", df_name
datasink2 = glueContext.write_dynamic_frame.from_options(frame = m_df, connection_type = "s3", connection_options = {"path": "s3://aws-glue-relationalize-stackoverflow/" + df_name +"/"}, format = "csv", transformation_ctx = "datasink2")
job.commit()
main table
advertiserCountry,advertiserId,amendReason,amended,clickDate,clickDevice,clickRefs.clickRef2,clickRefs.clickRef6,commissionAmount.amount,"commissionAmount.currency","commissionSharingPublisherId",commissionStatus,customParameters,customerCountry,declineReason,id,ipHash,lapseTime,oldCommissionAmount,oldSaleAmount,orderRef,originalSaleAmount,paidToPublisher,paymentId,publisherId,publisherUrl,saleAmount.amount,saleAmount.currency,siteName,transactionDate,transactionDevice,transactionParts,transactionQueryId,type,url,validationDate,voucherCode,voucherCodeUsed,partition_0
AT,123456,,false,2018-09-05T16:31:00,iPhone,"asdsdedrfrgthyjukiloujhrdf45654565423212",www.website.at,1.5,EUR,,pending,,AT,,321547896,-27670654789123380,68,,,,,false,0,654987,,1.0,EUR,https://www.site.at,2018-09-05T16:32:00,iPhone,1,0,Lead,https://www.website.at,,,false,advertise
Another table for transaction parts
id,index,"transactionParts.val.amount","transactionParts.val.commissionAmount","transactionParts.val.commissionGroupCode","transactionParts.val.commissionGroupId","transactionParts.val.commissionGroupName"
1,0,1.0,1.5,LEAD,654654,Lead
Glue generated primary key column named "transactionParts" in the base table and the id in the transactionparts table is the foreign key to that column. As you can see it preserved, the original id column as it is.
Can you please try the code on your data and see if it works (changing the source table name as per yours)? Try to write to S3 as CSV first to figure out if thats working. Please let me know your findings.

Related

Use of ResolveChoice in Glue

I was able to create a small glue job to ingest data from one S3 bucket into another, but not clear about few last lines in the code(below).
applymapping1 = ApplyMapping.apply(frame = datasource_lk, mappings = [("row_id", "bigint", "row_id", "bigint"), ("Quantity", "long", "Quantity", "long"),("Category", "string", "Category", "string") ], transformation_ctx = "applymapping1")
selectfields2 = SelectFields.apply(frame = applymapping1, paths = ["row_id", "Quantity", "Category"], transformation_ctx = "selectfields2")
resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "mydb", table_name = "order_summary_csv", transformation_ctx = "resolvechoice3")
datasink4 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice3, database = "mydb", table_name = "order_summary_csv", transformation_ctx = "datasink4")
job.commit()
From the above code snippet, what is the use 'ResolveChoice'? is it mandatory?
When I ran this job, It has created a new folder and file(with some random file name) in the destination(order_summary.csv) and ingested data instead of ingesting directly into my order_summary_csv table(a CSV file) residing in the S3 folder. Is it possible for spark(Glue) to ingest data into a desired CSV file?
I think this ResolveChoice apply method call is out of date since there is no such choice in the doc like "MATCH_CATALOG"
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-ResolveChoice.html
The general idea behind ResolveCHoice is that if you have field with int values and string values - you should resolve how to handle this field:
Cast it to Int
Cast it to String
Leave both and create 2 columns in the result dataset
-You can't write a glue dynamicframe/dataframe to csv format with specific file name as in the backend spark write it with random partition name.
-ResolveChoice is useful when your dynamic frame has a column having records with different datatype.So, unlike spark dataframe,glue dynamicframe doesn't provide string as default datatype rather retain both datatypes. Now, you using resolveChoice we can choose which datatype ideally it should have and records with other datatype will set to null.

Glue job is deleting columns when creating multiple partitions

My Glue job reads a table (an S3 csv file) and then partition it and writes 10 Json files on S3.
I noticed that for some lines in the resulted files, some columns are gone!
This is the line:
etalab_named_postgre_csv = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "tab", transformation_ctx = "datasource0")
applymapping_etalab_named_postgre_csv= ApplyMapping.apply(frame = etalab_named_postgre_csv, mappings = [("compldistrib", "string", "compldistrib", "string"), ("numvoie", "long", "numvoie", "long"),....], transformation_ctx = "applymapping1")
path_s3 = "s3://Bucket"
etalab_named_postgre_csv = applymapping_etalab_named_postgre_csv.toDF()
etalab_named_postgre_csv.repartition(10).write.format("json").option("sep",",").option("header", "true").option("mode","Overwrite").save(path_s3)
On the output files some of the columns just disappears!
I used Spark on EMR to load the same input table to check the existence of the columns that disappeared.
Is this a common Glue behaviour? How can I prevent that please?
EDIT:
I am now sure of the problem.
It seems that the Glue mapping is the source of the problem. When I do
applymapping_etalab_named_postgre_csv= ApplyMapping.apply(frame = etalab_named_postgre_csv, mappings = [("compldistrib", "string", "compldistrib", "string"), ("numvoie", "long", "numvoie", "long"),....], transformation_ctx = "applymapping1")
I declare that compldistrib is a String and I want it as a String in output. If a row contains a numeric value in compldistrib, the mapping will just ignore it!
Is this a bug?
So after hours of searching I didn't find a solution. The alternative i found is replacing the Glue job with a Spark job using EMR. It is also a lot quicker.
I hope this will help someone.

AWS push down predicate not working when reading HIVE partitions

Trying to test out some glue functionality and the push down predicate is not working on avro files within S3 that were partitioned for use in HIVE. Our partitions are as follows: YYYY-MM-DD.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
filterpred = "loaddate == '2019-08-08'"
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "hive",
table_name = "stuff",
pushDownPredicate = filterpred)
print ('############################################')
print "COUNT: ", datasource0.count()
print ('##############################################')
df = datasource0.toDF()
df.show(5)
job.commit()
However I still see glue pulling in dates way outside of the range.:
Opening 's3://data/2018-11-29/part-00000-a58ee9cb-c82c-46e6-9657-85b4ead2927d-c000.avro' for reading
2019-09-13 13:47:47,071 INFO [Executor task launch worker for task 258] s3n.S3NativeFileSystem (S3NativeFileSystem.java:open(1208)) -
Opening 's3://data/2017-09-28/part-00000-53c07db9-05d7-4032-aa73-01e239f509cf.avro' for reading
I tried using the examples in the following:
AWS Glue DynamicFrames and Push Down Predicate
AWS Glue DynamicFrames and Push Down Predicate
AWS Glue pushdown predicate not working properly
And currently none of the solutions proposed are working for me. I tried adding the partition column(loaddate), taking it out, quoting, unquoting, etc. Still grabs outside of the date range.
There is a syntax error in your code. The correct parameter to pass to from_catalog function is "push_down_predicate" and not "pushDownPredicate".
Sample snippet :
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database = "hive",
table_name = "stuff",
push_down_predicate = filterpred)
Reference AWS Documentation : https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html
Seems like your partition is not in Hive naming style so you have to use a default one partition_0 in a query. Also, as suggested in another answer, the parameter is called push_down_predicate:
filterpred = "partition_0 == '2019-08-08'"
datasource0 = glue_context.create_dynamic_frame.from_catalog(
database = "hive",
table_name = "stuff",
push_down_predicate = filterpred)
Make sure your code is partition properly and run in Glue crawlerto create partition table .
Run query in Athena to repair your table .
MSCK REPAIR TABLE tbl;
Run query in Athena to check partition .
SHOW PARTITIONS tbl;
Scala you can use following code
Without predicate
val datasource0 = glueContext.getCatalogSource(database = "ny_taxi_db", tableName = "taxi_tbl", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
datasource0.toDF().count()
With Predicate :
val predicate = "(year == '2016' and year_month == '201601' and year_month_day == '20160114')"
val datasource1 = glueContext.getCatalogSource(database = "ny_taxi_db",tableName = "taxi_tbl" , transformationContext = "datasource1",pushDownPredicate = predicate).getDynamicFrame() //
datasource1.toDF().count()
Python you can use following code :
Without predicate
ds = glueContext.create_dynamic_frame.from_catalog(database =
"ny_taxi_db" , table_name = "taxi_data_by_vender", transformation_ctx =
"datasource0" )
ds.toDF().count()
With Predicate :
ds1 = glueContext.create_dynamic_frame.from_catalog(database = "ny_taxi_db" , table_name = "taxi_data_by_vender", transformation_ctx = "datasource1" , push_down_predicate = "(vendorid == 1)")
ds1.toDF().count()

How do I query a JDBC database within AWS Glue using a WHERE clause with PySpark?

I have a self authored Glue script and a JDBC Connection stored in the Glue catalog. I cannot figure out how to use PySpark to do a select statement from the MySQL database stored in RDS that my JDBC Connection points to. I have also used a Glue Crawler to infer the schema of the RDS table that I am interested in querying. How do I query the RDS database using a WHERE clause?
I have looked through the documentation for DynamicFrameReader and the GlueContext Class but neither seem to point me in the direction that I am seeking.
It depends on what you want to do. For example, if you want to do a select * from table where <conditions>, there are two options:
Assuming you created a crawler and inserted the source on your AWS Glue job like this:
# Read data from database
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "students", redshift_tmp_dir = args["TempDir"])
AWS Glue
# Select the needed fields
selectfields1 = SelectFields.apply(frame = datasource0, paths = ["user_id", "full_name", "is_active", "org_id", "org_name", "institution_id", "department_id"], transformation_ctx = "selectfields1")
filter2 = Filter.apply(frame = selectfields1, f = lambda x: x["org_id"] in org_ids, transformation_ctx="filter2")
PySpark + AWS Glue
# Change DynamicFrame to Spark DataFrame
dataframe = DynamicFrame.toDF(datasource0)
# Create a view
dataframe.createOrReplaceTempView("students")
# Use SparkSQL to select the fields
dataframe_sql_df_dim = spark.sql("SELECT user_id, full_name, is_active, org_id, org_name, institution_id, department_id FROM assignments WHERE org_id in (" + org_ids + ")")
# Change back to DynamicFrame
selectfields = DynamicFrame.fromDF(dataframe_sql_df_dim, glueContext, "selectfields2")

Upsert from AWS Glue to Amazon Redshift

I understand that there is no direct UPSERT query one can perform directly from Glue to Redshift. Is it possible to implement the staging table concept within the glue script itself?
So my expectation is creating the staging table, merging it with destination table and finally deleting it. Can it be achieved within the Glue script?
It is possible to implement upsert into Redshift using staging table in Glue by passing 'postactions' option to JDBC sink:
val destinationTable = "upsert_test"
val destination = s"dev_sandbox.${destinationTable}"
val staging = s"dev_sandbox.${destinationTable}_staging"
val fields = datasetDf.toDF().columns.mkString(",")
val postActions =
s"""
DELETE FROM $destination USING $staging AS S
WHERE $destinationTable.id = S.id
AND $destinationTable.date = S.date;
INSERT INTO $destination ($fields) SELECT $fields FROM $staging;
DROP TABLE IF EXISTS $staging
"""
// Write data to staging table in Redshift
glueContext.getJDBCSink(
catalogConnection = "redshift-glue-connections-test",
options = JsonOptions(Map(
"database" -> "conndb",
"dbtable" -> staging,
"overwrite" -> "true",
"postactions" -> postActions
)),
redshiftTmpDir = s"$tempDir/redshift",
transformationContext = "redshift-output"
).writeDynamicFrame(datasetDf)
Make sure the user used for writing to Redshift has sufficient permissions to create/drop tables in the staging schema.
Apparently connection_options dictionary parameter in glueContext.write_dynamic_frame.from_jdbc_conf function has 2 interesting parameters: preactions and postactions
target_table = "my_schema.my_table"
stage_table = "my_schema.#my_table_stage_table"
pre_query = """
drop table if exists {stage_table};
create table {stage_table} as select * from {target_table} LIMIT 0;""".format(stage_table=stage_table, target_table=target_table)
post_query = """
begin;
delete from {target_table} using {stage_table} where {stage_table}.id = {target_table}.id ;
insert into {target_table} select * from {stage_table};
drop table {stage_table};
end;""".format(stage_table=stage_table, target_table=target_table)
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame = datasource0, catalog_connection ="test_red", redshift_tmp_dir='s3://s3path', transformation_ctx="datasink4",
connection_options = {"preactions": pre_query, "postactions": post_query,
"dbtable": stage_table, "database": "redshiftdb"})
Based on https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/
Yes, it can be totally achievable. All you would need is to import pg8000 module into your glue job. pg8000 module is the python library which is used to make connection with Amazon Redshift and execute SQL queries through cursor.
Python Module Reference: https://github.com/mfenniak/pg8000
Then, make connection to your target cluster through pg8000.connect(user='user',database='dbname',host='hosturl',port=5439,password='urpasswrd')
And use the Glue,s datasink option to load into staging table and then run upsert sql query using pg8000 cursor
>>> import pg8000
>>> conn = pg8000.connect(user='user',database='dbname',host='hosturl',port=5439,password='urpasswrd')
>>> cursor = conn.cursor()
>>> cursor.execute("CREATE TEMPORARY TABLE book (id SERIAL, title TEXT)")
>>> cursor.execute("INSERT INTO TABLE final_target"))
>>> conn.commit()
You would need to zip the pg8000 package and put it in s3 bucket and reference it to the Python Libraries path under the Advanced options/Job parameters at Glue Job section.