So in AWS Athena we can use "$path" in select query and we get s3 path where that data is stored.
ex: my_s3_bucket/dev/dev-data=[date]/[something random alphanumeric]
date format is YYYY-MM-DD
so how can I get the same path but using GlueContext's create_dynamic_frame_from_options() method/function
where connection_type will be S3 and connection_options will have s3 paths.
ex: my_s3_bucket/data/dev-data=2022-10-16/
I would like to see the complete path by using DynamicFrame which will be my_s3_bucket/dev/dev-data=2022-10-16/[something random alphanumeric]
how to do that?
Related
I have a large table that I need to unload to S3, partitioning it by year and month, such that the following folder structure is achieved.
bucket_name/folder_name/year/month/
As of now I'm doing this:
unload ( 'select *, extract(year from question_asked_at) as year, extract(month from question_asked_at) as month from schema_name.table_name'
to 's3://bucket_name/folder_name/'
iam_role <iam_role>
partition by (year, month);
The results are:
bucket_name/folder_name/year=2001/month=01/,
bucket_name/folder_name/year=2001/month=02/
The partitioning works but I need to remove the year= and month= prefixes, any suggestions?
The format partition_column=value is a convention coined by Hive. Redshift UNLOAD is following that convention (see Redshift manual for UNLOAD.
I think that to achieve your goal, you'd need to move files to new prefixes (without year= and month=) as a separate process, using eg. bash or python and some regex magic.
I have tried to scribble how to do that with boto3, and that's what I came up with:
import boto3
import re
s3 = boto3.resource("s3")
bucket_name = "sbochniak-zalon-eu-central-1"
prefix = "firehose_zalon_backend_events/"
keys = [
o.key
for o in
s3.Bucket(bucket_name).objects.filter(Prefix=prefix).all()
]
new_keys = [
re.sub('^(.*)year=(\w+)(.*)month=(\w+)(.*)$', r'\1\2\3\4\5', k)
for k in
keys
]
for old_key, new_key in zip(keys, new_keys):
s3.Object(bucket_name, new_key).copy_from(CopySource={"Bucket": bucket_name, "Key": old_key})
s3.Object(bucket_name, old_key).delete()
I have close to 90 GB of data that needs to be uploaded to an S3 bucket with a specific naming convention.
If I use CTAS query with external_location it does not give me the option to give the file a specific name. Additionally with format csv is not an option.
CREATE TABLE ctas_csv_partitioned
WITH (
format = 'TEXTFILE',
external_location = 's3://my_athena_results/ctas_csv_partitioned/',
partitioned_by = ARRAY['key1']
)
AS SELECT name1, address1, comment1, key1
FROM tables1
I want to upload the output file so it look as sample_file.csv.gz
What is the easiest way to go about this?
Unfortunately, there is no way to specify neither file name nor the extension for it with Athena alone. Moreover, files created with CTAS query won't have any file extension at all. However, you can rename files directly with CLI for S3.
aws s3 ls s3://path/to/external/location/ --recursive \
| awk '{cmd="aws s3 mv s3://path/to/external/location/"$4 " s3://path/to/external/location/"$4".csv.gz"; system(cmd)}'
Just have tried this snippet and everything worked fine. However, sometimes an empty file s3://path/to/external/location/.csv.gz would also got created. Note I didn't include --recursive option for aws s3 mv since it would also produce weird results.
As far as format field is concerned, then you simply need to add field_delimiter=',' into WITH clause.
CREATE TABLE ctas_csv_partitioned
WITH (
format = 'TEXTFILE',
field_delimiter=','
external_location = 's3://my_athena_results/ctas_csv_partitioned/',
partitioned_by = ARRAY['key1']
)
AS SELECT name1, address1, comment1, key1
FROM tables1
The current set-up:
S3 location with json files. All files stored in the same location (no day/month/year structure).
Glue Crawler reads the data in a catalog table
Glue ETL job transforms and stores the data into parquet tables in s3
Glue Crawler reads from s3 parquet tables and stores into a new table that gets queried by Athena
What I want to achieve is the parquet tables to be partitioned by day (1) and the parquet tables for 1 day to be in the same file (2). Currently there is a parquet table for each json file.
How would I go about it?
One thing to mention, there is a datetime column in the data, but it's a unix epoch timestamp. I would probably need to convert that to a 'year/month/day' format, otherwise I'm assuming it will create a partition for each file again.
Thanks a lot for your help!!
Convert Glue's DynamicFrame into Spark's DataFrame to add year/month/day columns and repartition. Reducing partitions to one will ensure that only one file will be written into a folder, but it may slow down job performance.
Here is python code:
from pyspark.sql.functions import col,year,month,dayofmonth,to_date,from_unixtime
...
df = dynamicFrameSrc.toDF()
repartitioned_with_new_columns_df = df
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
.withColumn(“year”, year(col(“date_col”)))
.withColumn(“month”, month(col(“date_col”)))
.withColumn(“day”, dayofmonth(col(“date_col”)))
.drop(col(“date_col”))
.repartition(1)
dyf = DynamicFrame.fromDF(repartitioned_with_new_columns_df, glueContext, "enriched")
datasink = glueContext.write_dynamic_frame.from_options(
frame = dyf,
connection_type = "s3",
connection_options = {
"path": "s3://yourbucket/data”,
"partitionKeys": [“year”, “month”, “day”]
},
format = “parquet”,
transformation_ctx = "datasink"
)
Note that the from pyspark.qsl.functions import col can give a reference error, this shouldn't be a problem as explained here.
I cannot comment so I am going to write as an answer.
I used Yuriy's code and a couple of things needed adjustment:
missing brackets
df = dynamicFrameSrc.toDF()
after toDF() I had to add select("*") otherwise schema was empty
df.select("*")
.withColumn(“date_col”, to_date(from_unixtime(col(“unix_time_col”))))
To achieve this in AWS Glue Studio:
You will need to make a custom function to convert the datetime field to date. There is the extra step of converting it back to a DynamicFrameCollection.
In Python:
def MyTransform(glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df_with_date = df.withColumn('date_field', df['datetime_field'].cast('date'))
glue_df = DynamicFrame.fromDF(df_with_date, glueContext, "transform_date")
return(DynamicFrameCollection({"CustomTransform0": glue_df}, glueContext))
You would then have to edit the custom transformer schema to include that new date field you just created.
You can then use the "data target" node to write the data to disk and then select that new date field to use as a partition.
video step by step walkthrough
I have to write a Spark DataFrame into S3 bucket and it should create a separate parquet file for each partition.
Here is my code:
dynamicDataFrame = DynamicFrame.fromDF(
testDataFrame, glueContext ,
"dynamicDataFrame")
glueContext.write_dynamic_frame.from_options(
frame = dynamicDataFrame,
connection_type = "s3",
connection_options = {
"path": "s3://BUCKET_NAME/DIR_NAME",
"partitionKeys": ["COL_NAME"]
},
format = "parquet"
)
When I specify "partitionKeys": ["COL_NAME"] option then Glue Job gets executed without any error but it does not create any file in S3.
And when I remove this "partitionKeys" option then it creates 200 parquet files in S3(default No Of Partition is 200).
But I want to create partitions on the basis of a particular column.
So, is it possible to create partition wise parquet files in S3 while writing a DF in S3?
Note: I am using AWS resources i.e. AWS Glue.
R you sure partition column has data?
Do you find anything in glue logs
I am able to write to parquet format and partitioned by a column like so:
jobname = args['JOB_NAME']
#header is a spark DataFrame
header.repartition(1).write.parquet('s3://bucket/aws-glue/{}/header/'.format(jobname), 'append', partitionBy='date')
But I am not able to do this with Glue's DynamicFrame.
header_tmp = DynamicFrame.fromDF(header, glueContext, "header")
glueContext.write_dynamic_frame.from_options(frame = header_tmp, connection_type = "s3", connection_options = {"path": 's3://bucket/output/header/'}, format = "parquet")
I have tried passing the partitionBy as a part of connection_options dict, since AWS docs say for parquet Glue does not support any format options, but that didn't work.
Is this possible, and how? As for reasons for doing it this way, I thought it was needed for job bookmarking to work, as that is not working for me currently.
From AWS Support (paraphrasing a bit):
As of today, Glue does not support partitionBy parameter when writing to parquet. This is in the pipeline to be worked on though.
Using the Glue API to write to parquet is required for job bookmarking feature to work with S3 sources.
So as of today it is not possible to partition parquet files AND enable the job bookmarking feature.
Edit: today (3/23/18) I found in the documentations:
glue_context.write_dynamic_frame.from_options(
frame = projectedEvents,
connection_options = {"path": "$outpath", "partitionKeys": ["type"]},
format = "parquet")
That option may have always been there and both myself and the AWS support person missed it, or it was only added recently. Either way, it seems like it is possible now.
I use some of the columns from my dataframe as the partionkeys object:
glueContext.write_dynamic_frame \
.from_options(
frame = some_dynamic_dataframe,
connection_type = "s3",
connection_options = {"path":"some_path", "partitionKeys": ["month", "day"]},
format = "parquet")