I have to write a Spark DataFrame into S3 bucket and it should create a separate parquet file for each partition.
Here is my code:
dynamicDataFrame = DynamicFrame.fromDF(
testDataFrame, glueContext ,
"dynamicDataFrame")
glueContext.write_dynamic_frame.from_options(
frame = dynamicDataFrame,
connection_type = "s3",
connection_options = {
"path": "s3://BUCKET_NAME/DIR_NAME",
"partitionKeys": ["COL_NAME"]
},
format = "parquet"
)
When I specify "partitionKeys": ["COL_NAME"] option then Glue Job gets executed without any error but it does not create any file in S3.
And when I remove this "partitionKeys" option then it creates 200 parquet files in S3(default No Of Partition is 200).
But I want to create partitions on the basis of a particular column.
So, is it possible to create partition wise parquet files in S3 while writing a DF in S3?
Note: I am using AWS resources i.e. AWS Glue.
R you sure partition column has data?
Do you find anything in glue logs
Related
So in AWS Athena we can use "$path" in select query and we get s3 path where that data is stored.
ex: my_s3_bucket/dev/dev-data=[date]/[something random alphanumeric]
date format is YYYY-MM-DD
so how can I get the same path but using GlueContext's create_dynamic_frame_from_options() method/function
where connection_type will be S3 and connection_options will have s3 paths.
ex: my_s3_bucket/data/dev-data=2022-10-16/
I would like to see the complete path by using DynamicFrame which will be my_s3_bucket/dev/dev-data=2022-10-16/[something random alphanumeric]
how to do that?
I have used a "For in" loop script in AWS Glue to move 70 tables from S3 to Redshift. But, When I run the script again and again, data is being duplicated. I have seen one document as a solution for this.
https://aws.amazon.com/premiumsupport/knowledge-center/sql-commands-redshift-glue-job/
But, In my case, As I am using "for in" loop script for moving tables together, How can I Make use of creating staging table concept as in the document?
Here is the script I am using for moving tables to redshift:
client = boto3.client("glue", region_name="us-east-1")
databaseName = "db1_g"
Tables = client.get_tables(DatabaseName=databaseName)
tableList = Tables["TableList"]
for table in tableList:
tableName = table["Name"]
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="db1_g", table_name=tableName, transformation_ctx="datasource0"
)
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=datasource0,
catalog_connection="redshift",
connection_options={
"dbtable": f"schema1.{tableName}",
"database": "db1",
},
redshift_tmp_dir=args["TempDir"],
transformation_ctx="datasink4",
)
job.commit()
Is there any way to avoid duplicating data when we use loop script for moving tables?
The current set-up:
S3 location with json files. All files stored in the same location (no day/month/year structure).
Glue Crawler reads the data in a catalog table
Glue ETL job transforms and stores the data into parquet tables in s3
Glue Crawler reads from s3 parquet tables and stores into a new table that gets queried by Athena
What I want to achieve is the parquet tables to be partitioned by day (1) and the parquet tables for 1 day to be in the same file (2). Currently there is a parquet table for each json file.
How would I go about it?
One thing to mention, there is a datetime column in the data, but it's a unix epoch timestamp. I would probably need to convert that to a 'year/month/day' format, otherwise I'm assuming it will create a partition for each file again.
Thanks a lot for your help!!
Convert Glue's DynamicFrame into Spark's DataFrame to add year/month/day columns and repartition. Reducing partitions to one will ensure that only one file will be written into a folder, but it may slow down job performance.
Here is python code:
from pyspark.sql.functions import col,year,month,dayofmonth,to_date,from_unixtime
...
df = dynamicFrameSrc.toDF()
repartitioned_with_new_columns_df = df
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
.withColumn(“year”, year(col(“date_col”)))
.withColumn(“month”, month(col(“date_col”)))
.withColumn(“day”, dayofmonth(col(“date_col”)))
.drop(col(“date_col”))
.repartition(1)
dyf = DynamicFrame.fromDF(repartitioned_with_new_columns_df, glueContext, "enriched")
datasink = glueContext.write_dynamic_frame.from_options(
frame = dyf,
connection_type = "s3",
connection_options = {
"path": "s3://yourbucket/data”,
"partitionKeys": [“year”, “month”, “day”]
},
format = “parquet”,
transformation_ctx = "datasink"
)
Note that the from pyspark.qsl.functions import col can give a reference error, this shouldn't be a problem as explained here.
I cannot comment so I am going to write as an answer.
I used Yuriy's code and a couple of things needed adjustment:
missing brackets
df = dynamicFrameSrc.toDF()
after toDF() I had to add select("*") otherwise schema was empty
df.select("*")
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
To achieve this in AWS Glue Studio:
You will need to make a custom function to convert the datetime field to date. There is the extra step of converting it back to a DynamicFrameCollection.
In Python:
def MyTransform(glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df_with_date = df.withColumn('date_field', df['datetime_field'].cast('date'))
glue_df = DynamicFrame.fromDF(df_with_date, glueContext, "transform_date")
return(DynamicFrameCollection({"CustomTransform0": glue_df}, glueContext))
You would then have to edit the custom transformer schema to include that new date field you just created.
You can then use the "data target" node to write the data to disk and then select that new date field to use as a partition.
video step by step walkthrough
Does anyone have a quick way of getting the line count of a file hosted in S3? Preferably using the CLI, s3api but I am open to python/boto as well.
Note: solution must run non-interactively, ie in an overnight batch.
Right no i am doing this, it works but takes around 10 minutes for a 20GB file:
aws cp s3://foo/bar - | wc -l
Here's two methods that might work for you...
Amazon S3 has a new feature called S3 Select that allows you to query files stored on S3.
You can perform a count of the number of records (lines) in a file and it can even work on GZIP files. Results may vary depending upon your file format.
Amazon Athena is also a similar option that might be suitable. It can query files stored in Amazon S3.
Yes, Amazon S3 is having the SELECT feature, also keep an eye on the cost while executing any query from SELECT tab..
For example, here is the price #Jun2018 (This may varies)
S3 Select pricing is based on the size of the input, the output, and the data transferred.
Each query will cost 0.002 USD per GB scanned, plus 0.0007 USD per GB returned.
You can do it using python/boto3.
Define bucket_name and prefix:
colsep = ','
s3 = boto3.client('s3')
bucket_name = 'my-data-test'
s3_key = 'in/file.parquet'
Note that S3 SELECT can access only one file at a time.
Now you can open S3 SELECT cursor:
sql_stmt = """SELECT count(*) FROM s3object S"""
req_fact =s3.select_object_content(
Bucket = bucket_name,
Key = s3_key,
ExpressionType = 'SQL',
Expression = sql_stmt,
InputSerialization={'Parquet': {}},
OutputSerialization = {'CSV': {
'RecordDelimiter': os.linesep,
'FieldDelimiter': colsep}},
)
Now iterate thourgh returned records:
for event in req_fact['Payload']:
if 'Records' in event:
rr=event['Records']['Payload'].decode('utf-8')
for i, rec in enumerate(rr.split(linesep)):
if rec:
row=rec.split(colsep)
if row:
print('File line count:', row[0])
If you want to count records in all parquet files in a given S3 directory, check out this python/boto3 script: S3-parquet-files-row-counter
I am able to write to parquet format and partitioned by a column like so:
jobname = args['JOB_NAME']
#header is a spark DataFrame
header.repartition(1).write.parquet('s3://bucket/aws-glue/{}/header/'.format(jobname), 'append', partitionBy='date')
But I am not able to do this with Glue's DynamicFrame.
header_tmp = DynamicFrame.fromDF(header, glueContext, "header")
glueContext.write_dynamic_frame.from_options(frame = header_tmp, connection_type = "s3", connection_options = {"path": 's3://bucket/output/header/'}, format = "parquet")
I have tried passing the partitionBy as a part of connection_options dict, since AWS docs say for parquet Glue does not support any format options, but that didn't work.
Is this possible, and how? As for reasons for doing it this way, I thought it was needed for job bookmarking to work, as that is not working for me currently.
From AWS Support (paraphrasing a bit):
As of today, Glue does not support partitionBy parameter when writing to parquet. This is in the pipeline to be worked on though.
Using the Glue API to write to parquet is required for job bookmarking feature to work with S3 sources.
So as of today it is not possible to partition parquet files AND enable the job bookmarking feature.
Edit: today (3/23/18) I found in the documentations:
glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
format = "parquet")
That option may have always been there and both myself and the AWS support person missed it, or it was only added recently. Either way, it seems like it is possible now.
I use some of the columns from my dataframe as the partionkeys object:
glueContext.write_dynamic_frame \
.from_options(
frame = some_dynamic_dataframe,
connection_type = "s3",
connection_options = {"path":"some_path", "partitionKeys": ["month", "day"]},
format = "parquet")