AWS Glue job create_dynamic_frame_from_options() opening a specific file? - amazon-web-services

If one uses create_dynamic_frame_from_catalog(), you supply the database name and table name, e.g. created from a Glue crawler, which effectively names a specific input file. I want to be able to do the same (name a specific input file) without the crawler and database.
I've tried using create_dynamic_frame_from_options(), but the "path" connection option doesn't allow me to name the file, apparently. Is there any way to do this?

IIUC, you want to read multiple files from a specific s3 path and want the filename in your dataframe. You can achieve this by using spark session and reading it as pyspark dataframe
from pyspark.sql.functions import input_file_name
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
path = 's3://bucket/folder'
df = spark.read.csv(path)
df = df.withColumn('FileName', input_file_name())

Related

loop through multiple tables from source to s3 using glue (Python/Pyspark) through configuration file?

I am looking ingest multiple tables from a relational database to s3 using glue. The table details are present in a configuration file. The configuration file is a json file. Would be helpful to have a code that can loop through multiple table names and ingests these tables into s3. The glue script is written in python (pyspark)
this is sample how the configuration file looks :
{"main_key":{
"source_type": "rdbms",
"source_schema": "DATABASE",
"source_table": "DATABASE.Table_1",
}}
Assuming your Glue job can connect to the database and a Glue Connection has been added to it. Here's a sample extracted from my script that does something similar, you would need to update the jdbc url format that works for your database, this one uses sql server, implementation details for fetching the config file, looping through items, etc.
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from datetime import datetime
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
jdbc_url = f"jdbc:sqlserver://{hostname}:{port};databaseName={db_name}"
connection_details = {
"user": 'db_user',
"password": 'db_password',
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
}
tables_config = get_tables_config_from_s3_as_dict()
date_partition = datetime.today().strftime('%Y%m%d')
write_date_partition = f'year={date_partition[0:4]}/month={date_partition[4:6]}/day={date_partition[6:8]}'
for key, value in tables_config.items():
table = value['source_table']
df = spark.read.jdbc(url=jdbc_url, table=table, properties=connection_details)
write_path = f's3a://bucket-name/{table}/{write_date_partition}'
df.write.parquet(write_path)
Just write a normal for loop to loop through your DB configuration then follow Spark JDBC documentation to connect to each of them in sequence.

AWS Glue job to unzip a file from S3 and write it back to S3

I'm very new to AWS Glue, and I want to use AWS Glue to unzip a huge file present in a S3 bucket, and write the contents back to S3.
I couldn't find anything while trying to google this requirement.
My questions are:
How to add a zip file as data source to AWS Glue?
How to write it back to same S3 location?
I am using AWS Glue Studio. Any help will be highly appreciated.
If you are still looking for a solution. You're able to unzip a file and write it back with an AWS Glue Job by using boto3 and Python's zipfile library.
A thing to consider is the size of the zip that you want to process. I've used the following script with a 6GB (zipped) 30GB (unzipped) file and it works fine. But might fail if the file is to heavy for the worker to buffer.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
import boto3
import io
from zipfile import ZipFile
s3 = boto3.client("s3")
bucket = "wayfair-datasource" # your s3 bucket name
prefix = "files/location/" # the prefix for the objects that you want to unzip
unzip_prefix = "files/unzipped_location/" # the location where you want to store your unzipped files
# Get a list of all the resources in the specified prefix
objects = s3.list_objects(
Bucket=bucket,
Prefix=prefix
)["Contents"]
# The following will get the unzipped files so the job doesn't try to unzip a file that is already unzipped on every run
unzipped_objects = s3.list_objects(
Bucket=bucket,
Prefix=unzip_prefix
)["Contents"]
# Get a list containing the keys of the objects to unzip
object_keys = [ o["Key"] for o in objects if o["Key"].endswith(".zip") ]
# Get the keys for the unzipped objects
unzipped_object_keys = [ o["Key"] for o in unzipped_objects ]
for key in object_keys:
obj = s3.get_object(
Bucket="wayfair-datasource",
Key=key
)
objbuffer = io.BytesIO(obj["Body"].read())
# using context manager so you don't have to worry about manually closing the file
with ZipFile(objbuffer) as zip:
filenames = zip.namelist()
# iterate over every file inside the zip
for filename in filenames:
with zip.open(filename) as file:
filepath = unzip_prefix + filename
if filepath not in unzipped_object_keys:
s3.upload_fileobj(file, bucket, filepath)
job.commit()
I couldn't find anything while trying to google this requirement.
You couldn't find anything about this, because this is not what Glue does. Glue can read gzip (not zip) files natively. If you have zip, then you have to convert all the files yourself in S3. Glue will not do it.
To convert the files, you can download them, re-pack, and re-upload in gzip format, or any other format that Glue supports.

ETL from AWS DataLake to RDS

I'm relatively new to DataLakes and Im going through some research for a project on AWS.
I have created a DataLake and have tables generated from Glue Crawlers, I can see the data in S3 and query it using Athena. So far so good.
There is a requirement to transform parts of the data stored in the datalake to RDS for applications to read the data. What is the best solution for ETL from S3 DataLake to RDS?
Most posts I've come across talk about ETL from RDS to S3 and not the other way around.
By creating a Glue Job using the Spark job type I was able to use my S3 table as a data source and an Aurora/MariaDB as the destination.
Trying the same with a python job type didn't allow me to view any S3 tables during the Glue Job Wizard screens.
Once the data is in Glue DataFrame of Spark DataFrame, wrinting it out is pretty much straight forward. Use RDBMS as data sink.
For example, to write to a Redshift DB,
// Write data to staging table in Redshift
glueContext.getJDBCSink(
catalogConnection = "redshift-glue-connections-test",
options = JsonOptions(Map(
"database" -> "conndb",
"dbtable" -> staging,
"overwrite" -> "true",
"preactions" -> "<another SQL queries>",
"postactions" -> "<some SQL queries>"
)),
redshiftTmpDir = tempDir,
transformationContext = "redshift-output"
).writeDynamicFrame(datasetDf)
As shown above, use the JDBC Connection you've created to write the data to.
You can accomplish that with a Glue Job. Sample code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
file_paths = ['path']
df = glueContext.create_dynamic_frame_from_options("s3", {'paths': file_paths}, format="csv", format_options={"separator": ",", "quoteChar": '"', "withHeader": True})
df.printSchema()
df.show(10)
options = {
'user': 'usr',
'password': 'pwd',
'url': 'url',
'dbtable': 'tabl'}
glueContext.write_from_options(frame_or_dfc=df, connection_type="mysql", connection_options=options)

AWS Glue job bookmark produces duplicates for csv files

We receive 1 csv file everyday in s3 bucket from our vendor at 11am.
I convert this file into parquet format using Glue at 11:30am.
I've enabled job bookmark to not process already processed files.
Nonetheless, I see some files are being reprocessed thus creating duplicates.
I read these questions and answers AWS Glue Bookmark produces duplicates for PARQUET and AWS Glue Job Bookmarking explanation
They gave good understanding of job bookmarking, but still do not address the issue.
AWS documentation says, it supports CSV files for bookmarking AWS documentation.
Wondering if someone help me understand what could be the problem and if possible solution as well :)
Edit:
Pasting sample code here as requested by Prabhakar.
staging_database_name = "my-glue-db"
s3_target_path = "s3://mybucket/mydata/"
"""
'date_index': date location in the file name
'date_only': only date column is inserted
'date_format': format of date
'path': sub folder name in master bucket
"""
#fouo classified files
tables_spec = {
'sample_table': {'path': 'sample_table/load_date=','pkey': 'mykey', 'orderkey':'myorderkey'}
}
spark_conf = SparkConf().setAll([
("spark.hadoop.fs.s3.enableServerSideEncryption", "true"),
("spark.hadoop.fs.s3.serverSideEncryption.kms.keyId", kms_key_id)
])
sc = SparkContext(conf=spark_conf)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
for table_name, spec in tables_spec.items():
datasource0 = glueContext.create_dynamic_frame.from_catalog(database=database_name,
table_name=table_name,
transformation_ctx='datasource0')
resolvechoice2 = ResolveChoice.apply(frame=datasource0, choice="make_struct", transformation_ctx='resolvechoice2')
# Create spark data frame with input_file_name column
delta_df = resolvechoice2.toDF().withColumn('ingest_datetime', lit(str(ingest_datetime)))
date_dyf = DynamicFrame.fromDF(delta_df, glueContext, "date_dyf")
master_folder_path1 = os.path.join(s3_target_path, spec['path']).replace('\\', '/')
master_folder_path=master_folder_path1+load_date
datasink4 = glueContext.write_dynamic_frame.from_options(frame=date_dyf,
connection_type='s3',
connection_options={"path": master_folder_path},
format='parquet', transformation_ctx='datasink4')
job.commit()
Spoke to AWS Support engineer and she mentioned that, she is able to reproduce the issue and have raised it with Glue technical team for resolution.
Nonetheless, I couldn't wait on them fixing the bug and have taken different approach.
Solution:
Disable Glue bookmark
After Glue job converts csv file to Parquet, I
move csv file to different location in S3 bucket.

Can a SSE:KMS Key ID be specified when writing to S3 in an AWS Glue Job?

If you follow the AWS Glue Add Job Wizard to create a script to write parquet files to S3 you end up with generated code something like this.
datasink4 = glueContext.write_dynamic_frame.from_options(
frame=dropnullfields3,
connection_type="s3",
connection_options={"path": "s3://my-s3-bucket/datafile.parquet"},
format="parquet",
transformation_ctx="datasink4",
)
Is it possible to specify a KMS key so that the data is encrypted in the bucket?
glue scala job
val spark: SparkContext = new SparkContext()
val glueContext: GlueContext = new GlueContext(spark)
spark.hadoopConfiguration.set("fs.s3.enableServerSideEncryption", "true")
spark.hadoopConfiguration.set("fs.s3.serverSideEncryption.kms.keyId", args("ENCRYPTION_KEY"))
I think syntax should be differ for Python, but idea the same
To spell out the answer using PySpark, you can do either
from pyspark.conf import SparkConf
[...]
spark_conf = SparkConf().setAll([
("spark.hadoop.fs.s3.enableServerSideEncryption", "true"),
("spark.hadoop.fs.s3.serverSideEncryption.kms.keyId", "<Your Key ID>")
])
sc = SparkContext(conf=spark_conf)
noticing the spark.hadoop prefix - or (uglier but shorter)
sc._jsc.hadoopConfiguration().set("fs.s3.enableServerSideEncryption", "true")
sc._jsc.hadoopConfiguration().set("fs.s3.serverSideEncryption.kms.keyId", "<Your Key ID>")
where sc is your current SparkContext.
This isn't necessary. Perhaps it was when the question was first posed, but the same can be achieved by creating a security configuration and associating that with the glue job. Just remember to have this in your script, otherwise it won't do it:
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
https://docs.aws.amazon.com/glue/latest/dg/encryption-security-configuration.html
https://docs.aws.amazon.com/glue/latest/dg/set-up-encryption.html