My Glue job reads a table (an S3 csv file) and then partition it and writes 10 Json files on S3.
I noticed that for some lines in the resulted files, some columns are gone!
This is the line:
etalab_named_postgre_csv = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "tab", transformation_ctx = "datasource0")
applymapping_etalab_named_postgre_csv= ApplyMapping.apply(frame = etalab_named_postgre_csv, mappings = [("compldistrib", "string", "compldistrib", "string"), ("numvoie", "long", "numvoie", "long"),....], transformation_ctx = "applymapping1")
path_s3 = "s3://Bucket"
etalab_named_postgre_csv = applymapping_etalab_named_postgre_csv.toDF()
etalab_named_postgre_csv.repartition(10).write.format("json").option("sep",",").option("header", "true").option("mode","Overwrite").save(path_s3)
On the output files some of the columns just disappears!
I used Spark on EMR to load the same input table to check the existence of the columns that disappeared.
Is this a common Glue behaviour? How can I prevent that please?
EDIT:
I am now sure of the problem.
It seems that the Glue mapping is the source of the problem. When I do
applymapping_etalab_named_postgre_csv= ApplyMapping.apply(frame = etalab_named_postgre_csv, mappings = [("compldistrib", "string", "compldistrib", "string"), ("numvoie", "long", "numvoie", "long"),....], transformation_ctx = "applymapping1")
I declare that compldistrib is a String and I want it as a String in output. If a row contains a numeric value in compldistrib, the mapping will just ignore it!
Is this a bug?
So after hours of searching I didn't find a solution. The alternative i found is replacing the Glue job with a Spark job using EMR. It is also a lot quicker.
I hope this will help someone.
Related
Does anyone know whether it's possible to tell the Glue writer to keep the column you're partitioning on in the actual dataframe?
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/
Here, $outpath is a placeholder for the base output path in S3. The
partitionKeys parameter can also be specified in Python in the
connection_options dict:
glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
format = "parquet")
When you execute this write, the type field is removed from the
individual records and is encoded in the directory structure.
I would like to keep the type field in the individual record.
I am not 100% sure if it possible to tell Glue to keep the column, but in the meantime you could use this workaround:
projectedEvents = projectedEvents.withColumn("type_partition",projectedEvents["type"])
glue_context.write_dynamic_frame.from_options(
frame=projectedEvents,
connection_options={"path": "$outpath", "partitionKeys": ["type_partition"]},
format="parquet"
)
I was able to create a small glue job to ingest data from one S3 bucket into another, but not clear about few last lines in the code(below).
applymapping1 = ApplyMapping.apply(frame = datasource_lk, mappings = [("row_id", "bigint", "row_id", "bigint"), ("Quantity", "long", "Quantity", "long"),("Category", "string", "Category", "string") ], transformation_ctx = "applymapping1")
selectfields2 = SelectFields.apply(frame = applymapping1, paths = ["row_id", "Quantity", "Category"], transformation_ctx = "selectfields2")
resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "mydb", table_name = "order_summary_csv", transformation_ctx = "resolvechoice3")
datasink4 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice3, database = "mydb", table_name = "order_summary_csv", transformation_ctx = "datasink4")
job.commit()
From the above code snippet, what is the use 'ResolveChoice'? is it mandatory?
When I ran this job, It has created a new folder and file(with some random file name) in the destination(order_summary.csv) and ingested data instead of ingesting directly into my order_summary_csv table(a CSV file) residing in the S3 folder. Is it possible for spark(Glue) to ingest data into a desired CSV file?
I think this ResolveChoice apply method call is out of date since there is no such choice in the doc like "MATCH_CATALOG"
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-ResolveChoice.html
The general idea behind ResolveCHoice is that if you have field with int values and string values - you should resolve how to handle this field:
Cast it to Int
Cast it to String
Leave both and create 2 columns in the result dataset
-You can't write a glue dynamicframe/dataframe to csv format with specific file name as in the backend spark write it with random partition name.
-ResolveChoice is useful when your dynamic frame has a column having records with different datatype.So, unlike spark dataframe,glue dynamicframe doesn't provide string as default datatype rather retain both datatypes. Now, you using resolveChoice we can choose which datatype ideally it should have and records with other datatype will set to null.
I'm currently in the process of automating my data lake ingestion process. I have data coming into my Raw zone (S3 bucket). In the bucket I have 27 folders, each corresponding to a database - each folder has x amount of csv files, each corresponding to a table. I have an S3 event (All object create events) triggering a lambda function which crawls my raw zone. I am able to see every table successfully. Upon completion I'd like to create an ETL job which moves the the data in the processed zone converting it to parquet, however given the amount of tables I have I don't want to manually create a job specifying each table as a "source".
I demoed my automation services by uploading a single csv file to my raw zone and the crawler ran and then the ETL job also ran converting the "s3 raw zone table" to parquet and landing it in to my processed zone. When i dropped my second table, the crawler was able to successfully recognize it as a new table in my raw zone but in my processed zone its merging the data to the first schema (even though they're completely different).
I would expect the following:
1) crawler to recognize the csv as a table
2) glue etl to convert the file to parquet
3) crawler to recognize parquet file(s) as a single table
The following code highlights the problem I was facing - the datasource that was specified is a table (folder) and everything within that folder was assumed to have the same schema.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "APPLICATION_XYZ", table_name = "RAW_ZONE_w1cqzldd5jpe", transformation_ctx = "datasource0")
## #type: ApplyMapping
## #args: [mapping = [("vendorid", "long", "vendorid", "long"), ("lpep_pickup_datetime", "string", "lpep_pickup_datetime", "string"), ("lpep_dropoff_datetime", "string", "lpep_dropoff_datetime", "string"), ("store_and_fwd_flag", "string", "store_and_fwd_flag", "string"), ("ratecodeid", "long", "ratecodeid", "long"), ("pulocationid", "long", "pulocationid", "long"), ("dolocationid", "long", "dolocationid", "long"), ("passenger_count", "long", "passenger_count", "long"), ("trip_distance", "double", "trip_distance", "double"), ("fare_amount", "double", "fare_amount", "double"), ("extra", "double", "extra", "double"), ("mta_tax", "double", "mta_tax", "double"), ("tip_amount", "double", "tip_amount", "double"), ("tolls_amount", "double", "tolls_amount", "double"), ("ehail_fee", "string", "ehail_fee", "string"), ("improvement_surcharge", "double", "improvement_surcharge", "double"), ("total_amount", "double", "total_amount", "double"), ("payment_type", "long", "payment_type", "long"), ("trip_type", "long", "trip_type", "long")], transformation_ctx = "applymapping1"]
## #return: applymapping1
## #inputs: [frame = datasource0]
Created an ETL job with the following function to loop through the tables in my database and write a parquet file to a new folder with the same name (so I can crawl the table and use athena to query).
databaseName = 'DATABASE'
Tables = client.get_tables( DatabaseName = databaseName )
tableList = Tables ['TableList']
for table in tableList:
tableName = table['Name']
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "DATABASE", table_name = tableName, transformation_ctx = "datasource0")
datasink4 = glueContext.write_dynamic_frame.from_options(frame = datasource0, connection_type = "s3", connection_options = {"path": "s3://processed-45ah4xoyqr1b/Application1/"+tableName+"/"}, format = "parquet", transformation_ctx = "datasink4")
job.commit()
I have a DynamicFrame in Glue and I am using the Relationalize method which creates me 3 new dynamic frames; root_table, root_table_1 and root_table_2.
When I print the Schema of the tables or after I inserted the tables in database I noticed that in the root_table the id is missing so I cannot make joins between the root_table and other tables.
I tried all the possible combinations.
Is there something i missing?
datasource1 = Relationalize.apply(frame = renameId, name = "root_ds", transformation_ctx = "datasource1")
print(datasource1.keys())
print(datasource1.values())
for df_name in datasource1.keys():
m_df = datasource1.select(df_name)
print "Writing to Redshift table: ", df_name
m_df.printSchema()
glueContext.write_dynamic_frame.from_jdbc_conf(frame = m_df, catalog_connection = "Redshift", connection_options = {"database" : "redshift", "dbtable" : df_name}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "df_to_db")
I used the code below (removing the import bits) on your data and got wrote into S3. I got two files as pasted after the code. I am reading from the glue catalog after running the crawler on your data.
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sampledb", table_name = "json_aws_glue_relationalize_stackoverflow", transformation_ctx = "datasource0")
dfc = datasource0.relationalize("advertise_root", "s3://aws-glue-temporary-009551040880-ap-southeast-2/")
for df_name in dfc.keys():
m_df = dfc.select(df_name)
print "Writing to S3 file: ", df_name
datasink2 = glueContext.write_dynamic_frame.from_options(frame = m_df, connection_type = "s3", connection_options = {"path": "s3://aws-glue-relationalize-stackoverflow/" + df_name +"/"}, format = "csv", transformation_ctx = "datasink2")
job.commit()
main table
advertiserCountry,advertiserId,amendReason,amended,clickDate,clickDevice,clickRefs.clickRef2,clickRefs.clickRef6,commissionAmount.amount,"commissionAmount.currency","commissionSharingPublisherId",commissionStatus,customParameters,customerCountry,declineReason,id,ipHash,lapseTime,oldCommissionAmount,oldSaleAmount,orderRef,originalSaleAmount,paidToPublisher,paymentId,publisherId,publisherUrl,saleAmount.amount,saleAmount.currency,siteName,transactionDate,transactionDevice,transactionParts,transactionQueryId,type,url,validationDate,voucherCode,voucherCodeUsed,partition_0
AT,123456,,false,2018-09-05T16:31:00,iPhone,"asdsdedrfrgthyjukiloujhrdf45654565423212",www.website.at,1.5,EUR,,pending,,AT,,321547896,-27670654789123380,68,,,,,false,0,654987,,1.0,EUR,https://www.site.at,2018-09-05T16:32:00,iPhone,1,0,Lead,https://www.website.at,,,false,advertise
Another table for transaction parts
id,index,"transactionParts.val.amount","transactionParts.val.commissionAmount","transactionParts.val.commissionGroupCode","transactionParts.val.commissionGroupId","transactionParts.val.commissionGroupName"
1,0,1.0,1.5,LEAD,654654,Lead
Glue generated primary key column named "transactionParts" in the base table and the id in the transactionparts table is the foreign key to that column. As you can see it preserved, the original id column as it is.
Can you please try the code on your data and see if it works (changing the source table name as per yours)? Try to write to S3 as CSV first to figure out if thats working. Please let me know your findings.
I am able to write to parquet format and partitioned by a column like so:
jobname = args['JOB_NAME']
#header is a spark DataFrame
header.repartition(1).write.parquet('s3://bucket/aws-glue/{}/header/'.format(jobname), 'append', partitionBy='date')
But I am not able to do this with Glue's DynamicFrame.
header_tmp = DynamicFrame.fromDF(header, glueContext, "header")
glueContext.write_dynamic_frame.from_options(frame = header_tmp, connection_type = "s3", connection_options = {"path": 's3://bucket/output/header/'}, format = "parquet")
I have tried passing the partitionBy as a part of connection_options dict, since AWS docs say for parquet Glue does not support any format options, but that didn't work.
Is this possible, and how? As for reasons for doing it this way, I thought it was needed for job bookmarking to work, as that is not working for me currently.
From AWS Support (paraphrasing a bit):
As of today, Glue does not support partitionBy parameter when writing to parquet. This is in the pipeline to be worked on though.
Using the Glue API to write to parquet is required for job bookmarking feature to work with S3 sources.
So as of today it is not possible to partition parquet files AND enable the job bookmarking feature.
Edit: today (3/23/18) I found in the documentations:
glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
format = "parquet")
That option may have always been there and both myself and the AWS support person missed it, or it was only added recently. Either way, it seems like it is possible now.
I use some of the columns from my dataframe as the partionkeys object:
glueContext.write_dynamic_frame \
.from_options(
frame = some_dynamic_dataframe,
connection_type = "s3",
connection_options = {"path":"some_path", "partitionKeys": ["month", "day"]},
format = "parquet")