The current set-up:
S3 location with json files. All files stored in the same location (no day/month/year structure).
Glue Crawler reads the data in a catalog table
Glue ETL job transforms and stores the data into parquet tables in s3
Glue Crawler reads from s3 parquet tables and stores into a new table that gets queried by Athena
What I want to achieve is the parquet tables to be partitioned by day (1) and the parquet tables for 1 day to be in the same file (2). Currently there is a parquet table for each json file.
How would I go about it?
One thing to mention, there is a datetime column in the data, but it's a unix epoch timestamp. I would probably need to convert that to a 'year/month/day' format, otherwise I'm assuming it will create a partition for each file again.
Thanks a lot for your help!!
Convert Glue's DynamicFrame into Spark's DataFrame to add year/month/day columns and repartition. Reducing partitions to one will ensure that only one file will be written into a folder, but it may slow down job performance.
Here is python code:
from pyspark.sql.functions import col,year,month,dayofmonth,to_date,from_unixtime
...
df = dynamicFrameSrc.toDF()
repartitioned_with_new_columns_df = df
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
.withColumn(“year”, year(col(“date_col”)))
.withColumn(“month”, month(col(“date_col”)))
.withColumn(“day”, dayofmonth(col(“date_col”)))
.drop(col(“date_col”))
.repartition(1)
dyf = DynamicFrame.fromDF(repartitioned_with_new_columns_df, glueContext, "enriched")
datasink = glueContext.write_dynamic_frame.from_options(
frame = dyf,
connection_type = "s3",
connection_options = {
"path": "s3://yourbucket/data”,
"partitionKeys": [“year”, “month”, “day”]
},
format = “parquet”,
transformation_ctx = "datasink"
)
Note that the from pyspark.qsl.functions import col can give a reference error, this shouldn't be a problem as explained here.
I cannot comment so I am going to write as an answer.
I used Yuriy's code and a couple of things needed adjustment:
missing brackets
df = dynamicFrameSrc.toDF()
after toDF() I had to add select("*") otherwise schema was empty
df.select("*")
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
To achieve this in AWS Glue Studio:
You will need to make a custom function to convert the datetime field to date. There is the extra step of converting it back to a DynamicFrameCollection.
In Python:
def MyTransform(glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df_with_date = df.withColumn('date_field', df['datetime_field'].cast('date'))
glue_df = DynamicFrame.fromDF(df_with_date, glueContext, "transform_date")
return(DynamicFrameCollection({"CustomTransform0": glue_df}, glueContext))
You would then have to edit the custom transformer schema to include that new date field you just created.
You can then use the "data target" node to write the data to disk and then select that new date field to use as a partition.
video step by step walkthrough
Related
Does anyone know whether it's possible to tell the Glue writer to keep the column you're partitioning on in the actual dataframe?
https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/
Here, $outpath is a placeholder for the base output path in S3. The
partitionKeys parameter can also be specified in Python in the
connection_options dict:
glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
format = "parquet")
When you execute this write, the type field is removed from the
individual records and is encoded in the directory structure.
I would like to keep the type field in the individual record.
I am not 100% sure if it possible to tell Glue to keep the column, but in the meantime you could use this workaround:
projectedEvents = projectedEvents.withColumn("type_partition",projectedEvents["type"])
glue_context.write_dynamic_frame.from_options(
frame=projectedEvents,
connection_options={"path": "$outpath", "partitionKeys": ["type_partition"]},
format="parquet"
)
I read the Glue catalog table, convert it to dataframe & print the schema using the below (spark with Python)
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
redshift_tmp_dir=args['TempDir'])
df = dyf.toDF()
df.printschema()
It works fine when the table has data.
But, It doesn't print the schema if the table is empty (it is unable to get the schema of an empty table). As a result the future joins are failing.
Is there an way to overcome this and make the dynamic frame get the table schema from catalog even for an empty table or any other alternatives?
I found a solution. It is not ideal but it works. If you call apply_mapping() on your DynamicFrame, it will preserve the schema in the DataFrame. For example, if your table has column last_name, you can do:
dyf = glueContext.create_dynamic_frame.from_catalog(database='database_name',
table_name='table_name',
df = dyf.apply_mapping([
("last_name", "string", "last_name", "string")
])toDF()
df.printschema()
I am writing an script for AWS Glue that is sourced in S3 stored parquet files, in which I am creating a DynamicFrame and attempting to use pushDownPredicate logic to restrict the data coming in.
The table partitions are (in order): account_id > region > vpc_id > dt
And the code for creating the dynamic_frame is the following:
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database = DATABASE_NAME,
table_name= TABLE_NAME,
push_down_predicate = "dt='" + DATE + "'")
where DATE = '2019-10-29'
However it seems that Glue still attempts to read data from other days. Maybe it's because I have to specify a push_down_predicate for the other criteria?
As per the comments, the logs show that the date partition column is marked as "dt" where as in your table it is being referred by the name "date"
Logs
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY/dt=2019-07-15
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-10-03
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-08-27
s3://route/account_id=XXX/region=eu-west-1/vpc_id=YYY//dt=2019-10-29 ...
Your Code
dynamic_frame = glueContext.create_dynamic_frame.from_catalog(
database = DATABASE_NAME,
table_name= TABLE_NAME,
push_down_predicate = "date='" + DATE + "'")
Change the date partitions column name to dt in your table and same in push_down_predicate parameter in the above code.
I also see extra forward slashes in some of the paths in above logs, were these partitions added manually through athena using ALTER TABLE command? If so, I would recommend to use MSCK REPAIR command to load all partitions in the table to avoid such issues. Extra blank slashes in S3 path some times lead to errors while doing ETL through spark.
I have to write a Spark DataFrame into S3 bucket and it should create a separate parquet file for each partition.
Here is my code:
dynamicDataFrame = DynamicFrame.fromDF(
testDataFrame, glueContext ,
"dynamicDataFrame")
glueContext.write_dynamic_frame.from_options(
frame = dynamicDataFrame,
connection_type = "s3",
connection_options = {
"path": "s3://BUCKET_NAME/DIR_NAME",
"partitionKeys": ["COL_NAME"]
},
format = "parquet"
)
When I specify "partitionKeys": ["COL_NAME"] option then Glue Job gets executed without any error but it does not create any file in S3.
And when I remove this "partitionKeys" option then it creates 200 parquet files in S3(default No Of Partition is 200).
But I want to create partitions on the basis of a particular column.
So, is it possible to create partition wise parquet files in S3 while writing a DF in S3?
Note: I am using AWS resources i.e. AWS Glue.
R you sure partition column has data?
Do you find anything in glue logs
In AWS Glue's catalog, I have an external table defined with partitions that looks roughly like this in S3 and partitions for new dates are added daily:
s3://my-data-lake/test-table/
2017/01/01/
part-0000-blah.csv.gz
.
.
part-8000-blah.csv.gz
2017/01/02/
part-0000-blah.csv.gz
.
.
part-7666-blah.csv.gz
How could I use Glue/Spark to convert this to parquet that is also partitioned by date and split across n files per day?. The examples don't cover partitioning or splitting or provisioning (how many nodes and how big). Each day contains a couple hundred GBs.
Because the source CSVs are not necessarily in the right partitions (wrong date) and are inconsistent in size, I'm hoping to to write to partitioned parquet with the right partition and more consistent size.
Since the source CSV files are not necessarily in the right date, you could add to them additional information regarding collect date time (or use any date if already available):
{"collectDateTime": {
"timestamp": 1518091828,
"timestampMs": 1518091828116,
"day": 8,
"month": 2,
"year": 2018
}}
Then your job could use this information in the output DynamicFrame and ultimately use them as partitions. Some sample code of how to achieve this:
from awsglue.transforms import *
from pyspark.sql.types import *
from awsglue.context import GlueContext
from awsglue.utils import getResolvedOptions
import sys
import datetime
###
# CREATE THE NEW SIMPLIFIED LINE
##
def create_simplified_line(event_dict):
# collect date time
collect_date_time_dict = event_dict["collectDateTime"]
new_line = {
# TODO: COPY YOUR DATA HERE
"myData": event_dict["myData"],
"someOtherData": event_dict["someOtherData"],
"timestamp": collect_date_time_dict["timestamp"],
"timestampmilliseconds": long(collect_date_time_dict["timestamp"]) * 1000,
"year": collect_date_time_dict["year"],
"month": collect_date_time_dict["month"],
"day": collect_date_time_dict["day"]
}
return new_line
###
# MAIN FUNCTION
##
# context
glueContext = GlueContext(SparkContext.getOrCreate())
# fetch from previous day source bucket
previous_date = datetime.datetime.utcnow() - datetime.timedelta(days=1)
# build s3 paths
s3_path = "s3://source-bucket/path/year={}/month={}/day={}/".format(previous_date.year, previous_date.month, previous_date.day)
# create dynamic_frame
dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options={"paths": [s3_path]}, format="json", format_options={}, transformation_ctx="dynamic_frame")
# resolve choices (optional)
dynamic_frame_resolved = ResolveChoice.apply(frame=dynamic_frame,choice="project:double",transformation_ctx="dynamic_frame_resolved")
# transform the source dynamic frame into a simplified version
result_frame = Map.apply(frame=dynamic_frame_resolved, f=create_simplified_line)
# write to simple storage service in parquet format
glueContext.write_dynamic_frame.from_options(frame=result_frame, connection_type="s3", connection_options={"path":"s3://target-bucket/path/","partitionKeys":["year", "month", "day"]}, format="parquet")
Did not test it, but the script is just a sample of how to achieve this and is fairly straightforward.
UPDATE
1) As for having specific file sizes/numbers in output partitions,
Spark's coalesce and repartition features are not yet implemented in Glue's Python API (only in Scala).
You can convert your dynamic frame into a data frame and leverage Spark's partition capabilities.
Convert to a dataframe and partition based on "partition_col"
partitioned_dataframe = datasource0.toDF().repartition(1)
Convert back to a DynamicFrame for further processing.
partitioned_dynamicframe = DynamicFrame.fromDF(partitioned_dataframe,
glueContext, "partitioned_df")
The good news is that Glue has an interesting feature that if you have more than 50,000 input files per partition it'll automatically group them to you.
In case you want to specifically set this behavior regardless of input files number (your case), you may set the following connection_options while "creating a dynamic frame from options":
dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="s3", connection_options={"paths": [s3_path], 'groupFiles': 'inPartition', 'groupSize': 1024 * 1024}, format="json", format_options={}, transformation_ctx="dynamic_frame")
In the previous example, it would attempt to group files into 1MB groups.
It is worth mentioning that this is not the same as coalesce, but it may help if your goal is to reduce the number of files per partition.
2) If files already exist in the destination, will it just safely add it (not overwrite or delete)
Glue's default SaveMode for write_dynamic_frame.from_options is to append.
When saving a DataFrame to a data source, if data/table already
exists, contents of the DataFrame are expected to be appended to
existing data.
3) Given each source partition may be 30-100GB, what's a guideline for # of DPUs
I'm afraid I won't be able to answer that. It depends on how fast it'll load your input files (size/number), your script's transformations, etc.
Import the datetime library
import datetime
Split the timestamp based on partition conditions
now=datetime.datetime.now()
year= str(now.year)
month= str(now.month)
day= str(now.day)
currdate= "s3:/Destination/"+year+"/"+month+"/"+day
Add the variable currdate in the path address in the writer class. The results will be patitioned parquet files.