AWS Glue write parquet with partitions - amazon-web-services

I am able to write to parquet format and partitioned by a column like so:
jobname = args['JOB_NAME']
#header is a spark DataFrame
header.repartition(1).write.parquet('s3://bucket/aws-glue/{}/header/'.format(jobname), 'append', partitionBy='date')
But I am not able to do this with Glue's DynamicFrame.
header_tmp = DynamicFrame.fromDF(header, glueContext, "header")
glueContext.write_dynamic_frame.from_options(frame = header_tmp, connection_type = "s3", connection_options = {"path": 's3://bucket/output/header/'}, format = "parquet")
I have tried passing the partitionBy as a part of connection_options dict, since AWS docs say for parquet Glue does not support any format options, but that didn't work.
Is this possible, and how? As for reasons for doing it this way, I thought it was needed for job bookmarking to work, as that is not working for me currently.

From AWS Support (paraphrasing a bit):
As of today, Glue does not support partitionBy parameter when writing to parquet. This is in the pipeline to be worked on though.
Using the Glue API to write to parquet is required for job bookmarking feature to work with S3 sources.
So as of today it is not possible to partition parquet files AND enable the job bookmarking feature.
Edit: today (3/23/18) I found in the documentations:
glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
format = "parquet")
That option may have always been there and both myself and the AWS support person missed it, or it was only added recently. Either way, it seems like it is possible now.

I use some of the columns from my dataframe as the partionkeys object:
glueContext.write_dynamic_frame \
.from_options(
frame = some_dynamic_dataframe,
connection_type = "s3",
connection_options = {"path":"some_path", "partitionKeys": ["month", "day"]},
format = "parquet")

Related

AWS Glue: Keep partitioned column as value in row after writing

Does anyone know whether it's possible to tell the Glue writer to keep the column you're partitioning on in the actual dataframe?
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/
Here, $outpath is a placeholder for the base output path in S3. The
partitionKeys parameter can also be specified in Python in the
connection_options dict:
glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
format = "parquet")
When you execute this write, the type field is removed from the
individual records and is encoded in the directory structure.
I would like to keep the type field in the individual record.
I am not 100% sure if it possible to tell Glue to keep the column, but in the meantime you could use this workaround:
projectedEvents = projectedEvents.withColumn("type_partition",projectedEvents["type"])
glue_context.write_dynamic_frame.from_options(
frame=projectedEvents,
connection_options={"path": "$outpath", "partitionKeys": ["type_partition"]},
format="parquet"
)

Use of ResolveChoice in Glue

I was able to create a small glue job to ingest data from one S3 bucket into another, but not clear about few last lines in the code(below).
applymapping1 = ApplyMapping.apply(frame = datasource_lk, mappings = [("row_id", "bigint", "row_id", "bigint"), ("Quantity", "long", "Quantity", "long"),("Category", "string", "Category", "string") ], transformation_ctx = "applymapping1")
selectfields2 = SelectFields.apply(frame = applymapping1, paths = ["row_id", "Quantity", "Category"], transformation_ctx = "selectfields2")
resolvechoice3 = ResolveChoice.apply(frame = selectfields2, choice = "MATCH_CATALOG", database = "mydb", table_name = "order_summary_csv", transformation_ctx = "resolvechoice3")
datasink4 = glueContext.write_dynamic_frame.from_catalog(frame = resolvechoice3, database = "mydb", table_name = "order_summary_csv", transformation_ctx = "datasink4")
job.commit()
From the above code snippet, what is the use 'ResolveChoice'? is it mandatory?
When I ran this job, It has created a new folder and file(with some random file name) in the destination(order_summary.csv) and ingested data instead of ingesting directly into my order_summary_csv table(a CSV file) residing in the S3 folder. Is it possible for spark(Glue) to ingest data into a desired CSV file?
I think this ResolveChoice apply method call is out of date since there is no such choice in the doc like "MATCH_CATALOG"
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-ResolveChoice.html
The general idea behind ResolveCHoice is that if you have field with int values and string values - you should resolve how to handle this field:
Cast it to Int
Cast it to String
Leave both and create 2 columns in the result dataset
-You can't write a glue dynamicframe/dataframe to csv format with specific file name as in the backend spark write it with random partition name.
-ResolveChoice is useful when your dynamic frame has a column having records with different datatype.So, unlike spark dataframe,glue dynamicframe doesn't provide string as default datatype rather retain both datatypes. Now, you using resolveChoice we can choose which datatype ideally it should have and records with other datatype will set to null.

Glue job is deleting columns when creating multiple partitions

My Glue job reads a table (an S3 csv file) and then partition it and writes 10 Json files on S3.
I noticed that for some lines in the resulted files, some columns are gone!
This is the line:
etalab_named_postgre_csv = glueContext.create_dynamic_frame.from_catalog(database = "db", table_name = "tab", transformation_ctx = "datasource0")
applymapping_etalab_named_postgre_csv= ApplyMapping.apply(frame = etalab_named_postgre_csv, mappings = [("compldistrib", "string", "compldistrib", "string"), ("numvoie", "long", "numvoie", "long"),....], transformation_ctx = "applymapping1")
path_s3 = "s3://Bucket"
etalab_named_postgre_csv = applymapping_etalab_named_postgre_csv.toDF()
etalab_named_postgre_csv.repartition(10).write.format("json").option("sep",",").option("header", "true").option("mode","Overwrite").save(path_s3)
On the output files some of the columns just disappears!
I used Spark on EMR to load the same input table to check the existence of the columns that disappeared.
Is this a common Glue behaviour? How can I prevent that please?
EDIT:
I am now sure of the problem.
It seems that the Glue mapping is the source of the problem. When I do
applymapping_etalab_named_postgre_csv= ApplyMapping.apply(frame = etalab_named_postgre_csv, mappings = [("compldistrib", "string", "compldistrib", "string"), ("numvoie", "long", "numvoie", "long"),....], transformation_ctx = "applymapping1")
I declare that compldistrib is a String and I want it as a String in output. If a row contains a numeric value in compldistrib, the mapping will just ignore it!
Is this a bug?
So after hours of searching I didn't find a solution. The alternative i found is replacing the Glue job with a Spark job using EMR. It is also a lot quicker.
I hope this will help someone.

How to partition data by datetime in AWS Glue?

The current set-up:
S3 location with json files. All files stored in the same location (no day/month/year structure).
Glue Crawler reads the data in a catalog table
Glue ETL job transforms and stores the data into parquet tables in s3
Glue Crawler reads from s3 parquet tables and stores into a new table that gets queried by Athena
What I want to achieve is the parquet tables to be partitioned by day (1) and the parquet tables for 1 day to be in the same file (2). Currently there is a parquet table for each json file.
How would I go about it?
One thing to mention, there is a datetime column in the data, but it's a unix epoch timestamp. I would probably need to convert that to a 'year/month/day' format, otherwise I'm assuming it will create a partition for each file again.
Thanks a lot for your help!!
Convert Glue's DynamicFrame into Spark's DataFrame to add year/month/day columns and repartition. Reducing partitions to one will ensure that only one file will be written into a folder, but it may slow down job performance.
Here is python code:
from pyspark.sql.functions import col,year,month,dayofmonth,to_date,from_unixtime
...
df = dynamicFrameSrc.toDF()
repartitioned_with_new_columns_df = df
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
.withColumn(“year”, year(col(“date_col”)))
.withColumn(“month”, month(col(“date_col”)))
.withColumn(“day”, dayofmonth(col(“date_col”)))
.drop(col(“date_col”))
.repartition(1)
dyf = DynamicFrame.fromDF(repartitioned_with_new_columns_df, glueContext, "enriched")
datasink = glueContext.write_dynamic_frame.from_options(
frame = dyf,
connection_type = "s3",
connection_options = {
"path": "s3://yourbucket/data”,
"partitionKeys": [“year”, “month”, “day”]
},
format = “parquet”,
transformation_ctx = "datasink"
)
Note that the from pyspark.qsl.functions import col can give a reference error, this shouldn't be a problem as explained here.
I cannot comment so I am going to write as an answer.
I used Yuriy's code and a couple of things needed adjustment:
missing brackets
df = dynamicFrameSrc.toDF()
after toDF() I had to add select("*") otherwise schema was empty
df.select("*")
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
To achieve this in AWS Glue Studio:
You will need to make a custom function to convert the datetime field to date. There is the extra step of converting it back to a DynamicFrameCollection.
In Python:
def MyTransform(glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df_with_date = df.withColumn('date_field', df['datetime_field'].cast('date'))
glue_df = DynamicFrame.fromDF(df_with_date, glueContext, "transform_date")
return(DynamicFrameCollection({"CustomTransform0": glue_df}, glueContext))
You would then have to edit the custom transformer schema to include that new date field you just created.
You can then use the "data target" node to write the data to disk and then select that new date field to use as a partition.
video step by step walkthrough

Is it possible to write a Partitioned DataFrame into S3 bucket?

I have to write a Spark DataFrame into S3 bucket and it should create a separate parquet file for each partition.
Here is my code:
dynamicDataFrame = DynamicFrame.fromDF(
testDataFrame, glueContext ,
"dynamicDataFrame")
glueContext.write_dynamic_frame.from_options(
frame = dynamicDataFrame,
connection_type = "s3",
connection_options = {
"path": "s3://BUCKET_NAME/DIR_NAME",
"partitionKeys": ["COL_NAME"]
},
format = "parquet"
)
When I specify "partitionKeys": ["COL_NAME"] option then Glue Job gets executed without any error but it does not create any file in S3.
And when I remove this "partitionKeys" option then it creates 200 parquet files in S3(default No Of Partition is 200).
But I want to create partitions on the basis of a particular column.
So, is it possible to create partition wise parquet files in S3 while writing a DF in S3?
Note: I am using AWS resources i.e. AWS Glue.
R you sure partition column has data?
Do you find anything in glue logs