How to iterate through a Glue DynamicFrame - amazon-web-services

Hi iam working AWS glue spark. I am grabbing the data from a dynamodb table and creating a dynamic frame from it. I want to be able to send all the data from that table, record by record in sqs. I have seen another suggest to convert dynamic frame to spark dataframe. But this is going to be a table with millions of records. Converting to a dataframe could take a while. I want to be able to just send all the records in the dynamic frame over to the sqs queue.
Here is my code:
sqs = boto3.resource('sqs')
sqs_queue_url = f"https://sqs.us-east-1.amazonaws.com/{account_id}/my-stream-queue"
queue = sqs.Queue(sqs_queue_url)
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
job = Job(glueContext)
## #params: [JOB_NAME]
job.init(args['JOB_NAME'], args)
logger = glueContext.get_logger()
df = glueContext.create_dynamic_frame.from_options("dynamodb",
connection_options={
"dynamodb.input.tableName": "my_table",
"dynamodb.throughput.read.percent": "1.5",
"dynamodb.splits": "500"
},
numSlots=2368)
job.commit()
# iterate over dynamic frame and send each record over the sqs queue
for record in df:
queue.send_message(MessageBody=record)

I am doing something very similar. Here is what I discovered:
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="athena",
table_name=str(args['value']),
transformation_ctx="datasource0")
job.commit()
df = datasource0.toDF()
pandasDF = df.toPandas()
for index, row in pandasDF.iterrows():
message_body = generate_message(
row['bucket'], row['key'], row['version_id'])
send_message(sqs_queue, json.loads(json.dumps(message_body)))

Related

Use of ResolveChoice in Glue

I was able to create a small glue job to ingest data from one S3 bucket into another, but not clear about few last lines in the code(below).
applymapping1 = ApplyMapping.apply(frame = datasource_lk, mappings = [("row_id", "bigint", "row_id", "bigint"), ("Quantity", "long", "Quantity", "long"),("Category", "string", "Category", "string") ], transformation_ctx = "applymapping1")
selectfields2 = SelectFields.apply(frame = applymapping1, paths = ["row_id", "Quantity", "Category"], transformation_ctx = "selectfields2")
resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "mydb", table_name = "order_summary_csv", transformation_ctx = "resolvechoice3")
datasink4 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice3, database = "mydb", table_name = "order_summary_csv", transformation_ctx = "datasink4")
job.commit()
From the above code snippet, what is the use 'ResolveChoice'? is it mandatory?
When I ran this job, It has created a new folder and file(with some random file name) in the destination(order_summary.csv) and ingested data instead of ingesting directly into my order_summary_csv table(a CSV file) residing in the S3 folder. Is it possible for spark(Glue) to ingest data into a desired CSV file?
I think this ResolveChoice apply method call is out of date since there is no such choice in the doc like "MATCH_CATALOG"
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-ResolveChoice.html
The general idea behind ResolveCHoice is that if you have field with int values and string values - you should resolve how to handle this field:
Cast it to Int
Cast it to String
Leave both and create 2 columns in the result dataset
-You can't write a glue dynamicframe/dataframe to csv format with specific file name as in the backend spark write it with random partition name.
-ResolveChoice is useful when your dynamic frame has a column having records with different datatype.So, unlike spark dataframe,glue dynamicframe doesn't provide string as default datatype rather retain both datatypes. Now, you using resolveChoice we can choose which datatype ideally it should have and records with other datatype will set to null.

AWS push down predicate not working when reading HIVE partitions

Trying to test out some glue functionality and the push down predicate is not working on avro files within S3 that were partitioned for use in HIVE. Our partitions are as follows: YYYY-MM-DD.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
filterpred = "loaddate == '2019-08-08'"
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "hive",
table_name = "stuff",
pushDownPredicate = filterpred)
print ('############################################')
print "COUNT: ", datasource0.count()
print ('##############################################')
df = datasource0.toDF()
df.show(5)
job.commit()
However I still see glue pulling in dates way outside of the range.:
Opening 's3://data/2018-11-29/part-00000-a58ee9cb-c82c-46e6-9657-85b4ead2927d-c000.avro' for reading
2019-09-13 13:47:47,071 INFO [Executor task launch worker for task 258] s3n.S3NativeFileSystem (S3NativeFileSystem.java:open(1208)) -
Opening 's3://data/2017-09-28/part-00000-53c07db9-05d7-4032-aa73-01e239f509cf.avro' for reading
I tried using the examples in the following:
AWS Glue DynamicFrames and Push Down Predicate
AWS Glue DynamicFrames and Push Down Predicate
AWS Glue pushdown predicate not working properly
And currently none of the solutions proposed are working for me. I tried adding the partition column(loaddate), taking it out, quoting, unquoting, etc. Still grabs outside of the date range.
There is a syntax error in your code. The correct parameter to pass to from_catalog function is "push_down_predicate" and not "pushDownPredicate".
Sample snippet :
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database = "hive",
table_name = "stuff",
push_down_predicate = filterpred)
Reference AWS Documentation : https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html
Seems like your partition is not in Hive naming style so you have to use a default one partition_0 in a query. Also, as suggested in another answer, the parameter is called push_down_predicate:
filterpred = "partition_0 == '2019-08-08'"
datasource0 = glue_context.create_dynamic_frame.from_catalog(
database = "hive",
table_name = "stuff",
push_down_predicate = filterpred)
Make sure your code is partition properly and run in Glue crawlerto create partition table .
Run query in Athena to repair your table .
MSCK REPAIR TABLE tbl;
Run query in Athena to check partition .
SHOW PARTITIONS tbl;
Scala you can use following code
Without predicate
val datasource0 = glueContext.getCatalogSource(database = "ny_taxi_db", tableName = "taxi_tbl", redshiftTmpDir = "", transformationContext = "datasource0").getDynamicFrame()
datasource0.toDF().count()
With Predicate :
val predicate = "(year == '2016' and year_month == '201601' and year_month_day == '20160114')"
val datasource1 = glueContext.getCatalogSource(database = "ny_taxi_db",tableName = "taxi_tbl" , transformationContext = "datasource1",pushDownPredicate = predicate).getDynamicFrame() //
datasource1.toDF().count()
Python you can use following code :
Without predicate
ds = glueContext.create_dynamic_frame.from_catalog(database =
"ny_taxi_db" , table_name = "taxi_data_by_vender", transformation_ctx =
"datasource0" )
ds.toDF().count()
With Predicate :
ds1 = glueContext.create_dynamic_frame.from_catalog(database = "ny_taxi_db" , table_name = "taxi_data_by_vender", transformation_ctx = "datasource1" , push_down_predicate = "(vendorid == 1)")
ds1.toDF().count()

How do I query a JDBC database within AWS Glue using a WHERE clause with PySpark?

I have a self authored Glue script and a JDBC Connection stored in the Glue catalog. I cannot figure out how to use PySpark to do a select statement from the MySQL database stored in RDS that my JDBC Connection points to. I have also used a Glue Crawler to infer the schema of the RDS table that I am interested in querying. How do I query the RDS database using a WHERE clause?
I have looked through the documentation for DynamicFrameReader and the GlueContext Class but neither seem to point me in the direction that I am seeking.
It depends on what you want to do. For example, if you want to do a select * from table where <conditions>, there are two options:
Assuming you created a crawler and inserted the source on your AWS Glue job like this:
# Read data from database
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "students", redshift_tmp_dir = args["TempDir"])
AWS Glue
# Select the needed fields
selectfields1 = SelectFields.apply(frame = datasource0, paths = ["user_id", "full_name", "is_active", "org_id", "org_name", "institution_id", "department_id"], transformation_ctx = "selectfields1")
filter2 = Filter.apply(frame = selectfields1, f = lambda x: x["org_id"] in org_ids, transformation_ctx="filter2")
PySpark + AWS Glue
# Change DynamicFrame to Spark DataFrame
dataframe = DynamicFrame.toDF(datasource0)
# Create a view
dataframe.createOrReplaceTempView("students")
# Use SparkSQL to select the fields
dataframe_sql_df_dim = spark.sql("SELECT user_id, full_name, is_active, org_id, org_name, institution_id, department_id FROM assignments WHERE org_id in (" + org_ids + ")")
# Change back to DynamicFrame
selectfields = DynamicFrame.fromDF(dataframe_sql_df_dim, glueContext, "selectfields2")

When using Relationalize in Glue there is no id in root table

I have a DynamicFrame in Glue and I am using the Relationalize method which creates me 3 new dynamic frames; root_table, root_table_1 and root_table_2.
When I print the Schema of the tables or after I inserted the tables in database I noticed that in the root_table the id is missing so I cannot make joins between the root_table and other tables.
I tried all the possible combinations.
Is there something i missing?
datasource1 = Relationalize.apply(frame = renameId, name = "root_ds", transformation_ctx = "datasource1")
print(datasource1.keys())
print(datasource1.values())
for df_name in datasource1.keys():
m_df = datasource1.select(df_name)
print "Writing to Redshift table: ", df_name
m_df.printSchema()
glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df, catalog_connection = "Redshift", connection_options = {"database" : "redshift", "dbtable" : df_name}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "df_to_db")
I used the code below (removing the import bits) on your data and got wrote into S3. I got two files as pasted after the code. I am reading from the glue catalog after running the crawler on your data.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "json_aws_glue_relationalize_stackoverflow", transformation_ctx = "datasource0")
dfc = datasource0.relationalize("advertise_root", "s3://aws-glue-temporary-009551040880-ap-southeast-2/")
for df_name in dfc.keys():
m_df = dfc.select(df_name)
print "Writing to S3 file: ", df_name
datasink2 = glueContext.write_dynamic_frame.from_options(frame = m_df, connection_type = "s3", connection_options = {"path": "s3://aws-glue-relationalize-stackoverflow/" + df_name +"/"}, format = "csv", transformation_ctx = "datasink2")
job.commit()
main table
advertiserCountry,advertiserId,amendReason,amended,clickDate,clickDevice,clickRefs.clickRef2,clickRefs.clickRef6,commissionAmount.amount,"commissionAmount.currency","commissionSharingPublisherId",commissionStatus,customParameters,customerCountry,declineReason,id,ipHash,lapseTime,oldCommissionAmount,oldSaleAmount,orderRef,originalSaleAmount,paidToPublisher,paymentId,publisherId,publisherUrl,saleAmount.amount,saleAmount.currency,siteName,transactionDate,transactionDevice,transactionParts,transactionQueryId,type,url,validationDate,voucherCode,voucherCodeUsed,partition_0
AT,123456,,false,2018-09-05T16:31:00,iPhone,"asdsdedrfrgthyjukiloujhrdf45654565423212",www.website.at,1.5,EUR,,pending,,AT,,321547896,-27670654789123380,68,,,,,false,0,654987,,1.0,EUR,https://www.site.at,2018-09-05T16:32:00,iPhone,1,0,Lead,https://www.website.at,,,false,advertise
Another table for transaction parts
id,index,"transactionParts.val.amount","transactionParts.val.commissionAmount","transactionParts.val.commissionGroupCode","transactionParts.val.commissionGroupId","transactionParts.val.commissionGroupName"
1,0,1.0,1.5,LEAD,654654,Lead
Glue generated primary key column named "transactionParts" in the base table and the id in the transactionparts table is the foreign key to that column. As you can see it preserved, the original id column as it is.
Can you please try the code on your data and see if it works (changing the source table name as per yours)? Try to write to S3 as CSV first to figure out if thats working. Please let me know your findings.

AWS Glue: ETL job creates many empty output files

I'm very new to this, so not sure if this script could be simplified/if I'm doing something wrong that's resulting in this happening. I've written an ETL script for AWS Glue that writes to a directory within an S3 bucket.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# catalog: database and table names
db_name = "events"
tbl_base_event_info = "base_event_info"
tbl_event_details = "event_details"
# output directories
output_dir = "s3://whatever/output"
# create dynamic frames from source tables
base_event_source = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_base_event_info)
event_details_source = glueContext.create_dynamic_frame.from_catalog(database = db_name, table_name = tbl_event_details)
# join frames
base_event_source_df = workout_event_source.toDF()
event_details_source_df = workout_device_source.toDF()
enriched_event_df = base_event_source_df.join(event_details_source_df, "event_id")
enriched_event = DynamicFrame.fromDF(enriched_event_df, glueContext, "enriched_event")
# write frame to json files
datasink = glueContext.write_dynamic_frame.from_options(frame = enriched_event, connection_type = "s3", connection_options = {"path": output_dir}, format = "json")
job.commit()
The base_event_info table has 4 columns: event_id, event_name, platform, client_info
The event_details table has 2 columns: event_id, event_details
The joined table schema should look like: event_id, event_name, platform, client_info, event_details
After I run this job, I expected to get 2 json files, since that's how many records are in the resulting joined table. (There are two records in the tables with the same event_id) However, what I get is about 200 files in the form of run-1540321737719-part-r-00000, run-1540321737719-part-r-00001, etc:
198 files contain 0 bytes
2 files contain 250 bytes (each with the correct info corresponding to the enriched events)
Is this the expected behavior? Why is this job generating so many empty files? Is there something wrong with my script?
The Spark SQL module contains the following default configuration:
spark.sql.shuffle.partitions set to 200.
that's why you are getting 200 files in the first place.
You can check if this is the case by doing the following:
enriched_event_df.rdd.getNumPartitions()
if you get a value of 200 then you can change it with the number of files you want to generate with the following code:
enriched_event_df.repartition(2)
The above code will create only two files with your data.
In my experience empty output files point to an error in transformations.
You can debug these using the error functions.
Btw. why are you doing the joins using Spark DataFrames instead of DynamicFrames?
Instead of repartition, you can add column like timestamp to the dataframe through spark sql transformation step and add it as partition key while writing the dataframe to S3
For example:
select replace(replace(replace(string(date_trunc('HOUR',current_timestamp())),'-',''),':',''),' ','') as datasetdate, * from myDataSource;
use datasetdate as partitionkey while writing dynamicframe, glue job should be able to add partitions automatically