Transfering files into S3 buckets based on Glue job status - amazon-web-services

I am new to **AWS Glue,** and my aim is to extract transform and load files uploaded in S3 bucket to RDS instance. Also I need to transfer the files into separate S3 buckets based on the Glue Job status (Success /Failure). There will be more than one file uploaded into the initial S3 bucket. How can I get the name of the files uploaded so that i can transfer those files to appropriate buckets.
Step 1: Upload files to S3 bucket1.
Step 2: Trigger lamda function to call Job1
Step 3: On success of job1 transfer file to S3 bucket2
Step 4: On failure transfer to another S3 bucket

Have a lambda event trigger listening to the folder you are uploading
the files to S3 In the lambda, use AWS Glue API to run the glue job
(essentially a python script in AWS Glue).
In Glue python script, use the appropriate library, such as pymysql, etc.
as an external library packaged with your python script.
Perform data load operations from S3 to your RDS tables. If you are
using Aurora Mysql, then AWS has provided a nice feature
"load from S3", so you can directly load the
file into the tables (you may need to do some configurations in the
PARAMETER GROUP / IAM Roles).
Lambda script to call glue job:
s3 = boto3.client('s3')
glue = boto3.client('glue')
def lambda_handler(event, context):
gluejobname="<YOUR GLUE JOB NAME>"
try:
runId = glue.start_job_run(JobName=gluejobname)
status = glue.get_job_run(JobName=gluejobname, RunId=runId['JobRunId'])
print("Job Status : ", status['JobRun']['JobRunState'])
except Exception as e:
print(e)
raise e
Glue Script:
import mysql.connector
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.context import DynamicFrame
from awsglue.transforms import *
from pyspark.sql.types import StringType
from pyspark.sql.types import DateType
from pyspark.sql.functions import unix_timestamp, from_unixtime
from pyspark.sql import SQLContext
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
url="<RDS URL>"
uname="<USER NAME>"
pwd="<PASSWORD>"
dbase="DBNAME"
def connect():
conn = mysql.connector.connect(host=url, user=uname, password=pwd, database=dbase)
cur = conn.cursor()
return cur, conn
def create_stg_table():
cur, conn = connect()
createStgTable1 = <CREATE STAGING TABLE IF REQUIRED>
loadQry = "LOAD DATA FROM S3 PREFIX 'S3://PATH FOR YOUR CSV' REPLACE INTO TABLE <DB.TABLENAME> FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' IGNORE 1 LINES (#var1, #var2, #var3, #var4, #var5, #var6, #var7, #var8) SET ......;"
cur.execute(createStgTable1)
cur.execute(loadQry)
conn.commit()
conn.close()
You can then create a cloudwatch alert wherein check for the glue job status, and depending upon the status, perform file copy operations between S3. We have similar setup in our production.
Regards
Yuva

Related

loop through multiple tables from source to s3 using glue (Python/Pyspark) through configuration file?

I am looking ingest multiple tables from a relational database to s3 using glue. The table details are present in a configuration file. The configuration file is a json file. Would be helpful to have a code that can loop through multiple table names and ingests these tables into s3. The glue script is written in python (pyspark)
this is sample how the configuration file looks :
{"main_key":{
"source_type": "rdbms",
"source_schema": "DATABASE",
"source_table": "DATABASE.Table_1",
}}
Assuming your Glue job can connect to the database and a Glue Connection has been added to it. Here's a sample extracted from my script that does something similar, you would need to update the jdbc url format that works for your database, this one uses sql server, implementation details for fetching the config file, looping through items, etc.
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from datetime import datetime
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
jdbc_url = f"jdbc:sqlserver://{hostname}:{port};databaseName={db_name}"
connection_details = {
"user": 'db_user',
"password": 'db_password',
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
}
tables_config = get_tables_config_from_s3_as_dict()
date_partition = datetime.today().strftime('%Y%m%d')
write_date_partition = f'year={date_partition[0:4]}/month={date_partition[4:6]}/day={date_partition[6:8]}'
for key, value in tables_config.items():
table = value['source_table']
df = spark.read.jdbc(url=jdbc_url, table=table, properties=connection_details)
write_path = f's3a://bucket-name/{table}/{write_date_partition}'
df.write.parquet(write_path)
Just write a normal for loop to loop through your DB configuration then follow Spark JDBC documentation to connect to each of them in sequence.

AWS Glue job to unzip a file from S3 and write it back to S3

I'm very new to AWS Glue, and I want to use AWS Glue to unzip a huge file present in a S3 bucket, and write the contents back to S3.
I couldn't find anything while trying to google this requirement.
My questions are:
How to add a zip file as data source to AWS Glue?
How to write it back to same S3 location?
I am using AWS Glue Studio. Any help will be highly appreciated.
If you are still looking for a solution. You're able to unzip a file and write it back with an AWS Glue Job by using boto3 and Python's zipfile library.
A thing to consider is the size of the zip that you want to process. I've used the following script with a 6GB (zipped) 30GB (unzipped) file and it works fine. But might fail if the file is to heavy for the worker to buffer.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
import boto3
import io
from zipfile import ZipFile
s3 = boto3.client("s3")
bucket = "wayfair-datasource" # your s3 bucket name
prefix = "files/location/" # the prefix for the objects that you want to unzip
unzip_prefix = "files/unzipped_location/" # the location where you want to store your unzipped files
# Get a list of all the resources in the specified prefix
objects = s3.list_objects(
Bucket=bucket,
Prefix=prefix
)["Contents"]
# The following will get the unzipped files so the job doesn't try to unzip a file that is already unzipped on every run
unzipped_objects = s3.list_objects(
Bucket=bucket,
Prefix=unzip_prefix
)["Contents"]
# Get a list containing the keys of the objects to unzip
object_keys = [ o["Key"] for o in objects if o["Key"].endswith(".zip") ]
# Get the keys for the unzipped objects
unzipped_object_keys = [ o["Key"] for o in unzipped_objects ]
for key in object_keys:
obj = s3.get_object(
Bucket="wayfair-datasource",
Key=key
)
objbuffer = io.BytesIO(obj["Body"].read())
# using context manager so you don't have to worry about manually closing the file
with ZipFile(objbuffer) as zip:
filenames = zip.namelist()
# iterate over every file inside the zip
for filename in filenames:
with zip.open(filename) as file:
filepath = unzip_prefix + filename
if filepath not in unzipped_object_keys:
s3.upload_fileobj(file, bucket, filepath)
job.commit()
I couldn't find anything while trying to google this requirement.
You couldn't find anything about this, because this is not what Glue does. Glue can read gzip (not zip) files natively. If you have zip, then you have to convert all the files yourself in S3. Glue will not do it.
To convert the files, you can download them, re-pack, and re-upload in gzip format, or any other format that Glue supports.

How to check the cluster status of Redshift using boto3 module in Python?

I need to check if the Redshift cluster is paused or available using Python. I am aware of the module boto3 which further provides a method describe_clusters(). I am not sure how to proceed further to write a Python script for that.
You could try
import boto3
import pandas as pd
DWH_CLUSTER_IDENTIFIER = 'Your clusterId'
KEY='Your AWS Key'
SECRET='Your AWS Secret'
# Get client to connect redshift
redshift = boto3.client('redshift',
region_name="us-east-1",
aws_access_key_id=KEY,
aws_secret_access_key=SECRET
)
# Get cluster list as it is in AWS console
myClusters = redshift.describe_clusters()['Clusters']
if len(myClusters) > 0:
df = pd.DataFrame(myClusters)
myCluster=df[df.ClusterIdentifier==DWH_CLUSTER_IDENTIFIER.lower()]
print("My cluster status is: {}".format(myCluster['ClusterAvailabilityStatus'].item()))
else:
print('No clusters available')

ETL from AWS DataLake to RDS

I'm relatively new to DataLakes and Im going through some research for a project on AWS.
I have created a DataLake and have tables generated from Glue Crawlers, I can see the data in S3 and query it using Athena. So far so good.
There is a requirement to transform parts of the data stored in the datalake to RDS for applications to read the data. What is the best solution for ETL from S3 DataLake to RDS?
Most posts I've come across talk about ETL from RDS to S3 and not the other way around.
By creating a Glue Job using the Spark job type I was able to use my S3 table as a data source and an Aurora/MariaDB as the destination.
Trying the same with a python job type didn't allow me to view any S3 tables during the Glue Job Wizard screens.
Once the data is in Glue DataFrame of Spark DataFrame, wrinting it out is pretty much straight forward. Use RDBMS as data sink.
For example, to write to a Redshift DB,
// Write data to staging table in Redshift
glueContext.getJDBCSink(
catalogConnection = "redshift-glue-connections-test",
options = JsonOptions(Map(
"database" -> "conndb",
"dbtable" -> staging,
"overwrite" -> "true",
"preactions" -> "<another SQL queries>",
"postactions" -> "<some SQL queries>"
)),
redshiftTmpDir = tempDir,
transformationContext = "redshift-output"
).writeDynamicFrame(datasetDf)
As shown above, use the JDBC Connection you've created to write the data to.
You can accomplish that with a Glue Job. Sample code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
file_paths = ['path']
df = glueContext.create_dynamic_frame_from_options("s3", {'paths': file_paths}, format="csv", format_options={"separator": ",", "quoteChar": '"', "withHeader": True})
df.printSchema()
df.show(10)
options = {
'user': 'usr',
'password': 'pwd',
'url': 'url',
'dbtable': 'tabl'}
glueContext.write_from_options(frame_or_dfc=df, connection_type="mysql", connection_options=options)

AWS Glue: get job_id from within the script using pyspark

I am trying to access the AWS ETL Glue job id from the script of that job. This is the RunID that you can see in the first column in the AWS Glue Console, something like jr_5fc6d4ecf0248150067f2. How do I get it programmatically with pyspark?
As it's documented in https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-get-resolved-options.html, it's passed in as a command line argument to the Glue Job. You can access the JOB_RUN_ID and other default/reserved or custom job parameters using getResolvedOptions() function.
import sys
from awsglue.utils import getResolvedOptions
args = getResolvedOptions(sys.argv)
job_run_id = args['JOB_RUN_ID']
NOTE: JOB_RUN_ID is a default identity parameter, we don't need to include it as part of options (the second argument to getResolvedOptions()) for getting its value during runtime in a Glue Job.
You can use boto3 SDK for python to access the AWS services
import boto3
def lambda_handler(event, context):
client = boto3.client('glue')
client.start_crawler(Name='test_crawler')
glue = boto3.client(service_name='glue', region_name='us-east-2',
endpoint_url='https://glue.us-east-2.amazonaws.com')
myNewJobRun = client.start_job_run(JobName=myJob['Name'])
print myNewJobRun['JobRunId']