ETL from AWS DataLake to RDS - amazon-web-services

I'm relatively new to DataLakes and Im going through some research for a project on AWS.
I have created a DataLake and have tables generated from Glue Crawlers, I can see the data in S3 and query it using Athena. So far so good.
There is a requirement to transform parts of the data stored in the datalake to RDS for applications to read the data. What is the best solution for ETL from S3 DataLake to RDS?
Most posts I've come across talk about ETL from RDS to S3 and not the other way around.

By creating a Glue Job using the Spark job type I was able to use my S3 table as a data source and an Aurora/MariaDB as the destination.
Trying the same with a python job type didn't allow me to view any S3 tables during the Glue Job Wizard screens.

Once the data is in Glue DataFrame of Spark DataFrame, wrinting it out is pretty much straight forward. Use RDBMS as data sink.
For example, to write to a Redshift DB,
// Write data to staging table in Redshift
glueContext.getJDBCSink(
catalogConnection = "redshift-glue-connections-test",
options = JsonOptions(Map(
"database" -> "conndb",
"dbtable" -> staging,
"overwrite" -> "true",
"preactions" -> "<another SQL queries>",
"postactions" -> "<some SQL queries>"
)),
redshiftTmpDir = tempDir,
transformationContext = "redshift-output"
).writeDynamicFrame(datasetDf)
As shown above, use the JDBC Connection you've created to write the data to.

You can accomplish that with a Glue Job. Sample code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext, SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
import time
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
file_paths = ['path']
df = glueContext.create_dynamic_frame_from_options("s3", {'paths': file_paths}, format="csv", format_options={"separator": ",", "quoteChar": '"', "withHeader": True})
df.printSchema()
df.show(10)
options = {
'user': 'usr',
'password': 'pwd',
'url': 'url',
'dbtable': 'tabl'}
glueContext.write_from_options(frame_or_dfc=df, connection_type="mysql", connection_options=options)

Related

AWS Glue Job - create parquet from Data Catalog Table to S3

Im struggling with what supposed to be a simple task (maybe i do not understand whats under the hood in the glue data catalog).
I have an S3 bucket with parquet files
I ran a glue crawler through it to create a catalogue table so i could query it in Athena, etc...
Im trying to create a glue job that retreives the data from that table (catalogue table), partitiones is, and saves in another S3 bucket. I receive this exception every time i use the catalogue as source:
Py4JJavaError: An error occurred while calling o497.pyWriteDynamicFrame.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
...
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 29 in stage 39.0 failed 4 times, most recent failure: Lost task 29.3 in stage 39.0 (TID 437, 172.34.18.186, executor 1): java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToDouble(Dictionary.java:57)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToDouble(ParquetDictionary.java:46)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getDouble(OnHeapColumnVector.java:460)
...blabla
However, if i use the initial S3 bucket as source, it works. I pulled the data from both sources and compared schemas (they are the same, at least on the surface).
Here is my test glue job code:
import sys
import pandas as pd
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import pyspark.sql.functions as f
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.types import StructField, StructType, StringType,LongType
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
## Read the data
from_catalogue = glueContext.create_dynamic_frame.from_catalog(database = "ss-demo-database", table_name = "tlc_green_data", transformation_ctx = "datasource0")
from_s3 = glueContext.create_dynamic_frame.from_options(
format_options={},
connection_type="s3",
format="parquet",
connection_options={
"paths": ["s3://test/tlc_green_data/green_tripdata_2014-02.parquet"],
"recurse": False,
},
transformation_ctx="from_s3",
)
from_s3.printSchema()
from_catalogue.printSchema()
S3_location = "s3://test/demo/"
# Result are store back to the S3 bucket (this works if from s3, doesnt work if from catalogue)
datasink = glueContext.write_dynamic_frame_from_options(
frame=from_catalogue,
connection_type="s3",
connection_options={
"path": S3_location
},
format="parquet",
format_options = {"compression": "snappy"},
transformation_ctx ="datasink")
job.commit()
I tried converting to spark df and with overwrite option - same thing
I tried only selecting 10 rows - same thing
Your help is greatly appreciated!

loop through multiple tables from source to s3 using glue (Python/Pyspark) through configuration file?

I am looking ingest multiple tables from a relational database to s3 using glue. The table details are present in a configuration file. The configuration file is a json file. Would be helpful to have a code that can loop through multiple table names and ingests these tables into s3. The glue script is written in python (pyspark)
this is sample how the configuration file looks :
{"main_key":{
"source_type": "rdbms",
"source_schema": "DATABASE",
"source_table": "DATABASE.Table_1",
}}
Assuming your Glue job can connect to the database and a Glue Connection has been added to it. Here's a sample extracted from my script that does something similar, you would need to update the jdbc url format that works for your database, this one uses sql server, implementation details for fetching the config file, looping through items, etc.
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from datetime import datetime
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
jdbc_url = f"jdbc:sqlserver://{hostname}:{port};databaseName={db_name}"
connection_details = {
"user": 'db_user',
"password": 'db_password',
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
}
tables_config = get_tables_config_from_s3_as_dict()
date_partition = datetime.today().strftime('%Y%m%d')
write_date_partition = f'year={date_partition[0:4]}/month={date_partition[4:6]}/day={date_partition[6:8]}'
for key, value in tables_config.items():
table = value['source_table']
df = spark.read.jdbc(url=jdbc_url, table=table, properties=connection_details)
write_path = f's3a://bucket-name/{table}/{write_date_partition}'
df.write.parquet(write_path)
Just write a normal for loop to loop through your DB configuration then follow Spark JDBC documentation to connect to each of them in sequence.

AWS Glue ETL to Redshift: DATE

I am using AWS Glue to ETL data to Redshift. I have been encountering an issue where my date is loading as null in Redshift.
What I have set-up:
Upload csv into S3, see sample data:
item | color | price | date
shirt| brown | 25.05 | 03-01-2018
pants| black | 20.99 | 02-14-2017
Crawl S3 object
Create a Redshift table, see schema:
item: string
color: string
price: decimal / numeric
date: date
Script to load data to Redshift, see script:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import to_date, col
from awsglue.dynamicframe import DynamicFrame
glueContext = GlueContext(SparkContext.getOrCreate())
items_dynamicframe = glueContext.create_dynamic_frame.from_catalog(
database = "rdshft-test",
table_name = "items")
items_dynamicframe.printSchema()
#Attempt to get date loaded correctly to Redshift
data_frame = items_dynamicframe.toDF()
data_frame.show()
data_frame = data_frame.withColumn("date",
to_date(col("date"),"d-M-Y"))
data_frame.show()
Any feedback is appreciated. Thank you.
I was able to resolve this issue by converting back to dynamic frame. When porting my data into notebook, I am using a dynamicframe. But, to convert string to date, I must use dataframe (more specifically pyspark sql functions). To load into Redshift, I must convert back to dynamicframe. Assuming this is a requirement with Glue?

aws glue, how to connect to external database server which not locate in aws cloud?

I need to ingest data from an existing database locate in our own network to redshift using aws glue, i can connect it from an EC2 instance, but no idea how to connect it from aws glue。 Would someone give me any advice? I think it would be something magic in VPC setting, but no hints after seraching google.
Would someone give me any advice?
AWS wrote dedicated articles about this topic:
How to access and analyze on-premises data stores using AWS Glue
Use AWS Glue to run ETL jobs against non-native JDBC data sources
These would be a good start.
#upload jdbc driver to s3 location, in my case "s3://manikantabucket/com.mysql.jdbc_5.1.5.jar"
add driver jar file in glue properties bar
##########code
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *
glueContext = GlueContext(SparkContext.getOrCreate())
connection_mysql8_options_source_emp = {
"url": "jdbc:mysql://ec2-2-138-264-235.ap-east-1.compute.amazonaws.com:3306/mysql",
"dbtable": "db",
"user": "root",
"password": "root",
"customJdbcDriverS3Path": "s3://manikantabucket/com.mysql.jdbc_5.1.5.jar",
"customJdbcDriverClassName": "com.mysql.jdbc.Driver"}
df_emp=df_emp.coalesce(1)
df_emp = glueContext.create_dynamic_frame.from_options(connection_type="mysql", connection_options=connection_mysql8_options_source_emp)
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=df_emp,
connection_type="s3",
format="csv",
connection_options={"path": "s3://manikantabucketo/test/", "partitionKeys": []},
transformation_ctx="S3bucket_node3",
)

Transfering files into S3 buckets based on Glue job status

I am new to **AWS Glue,** and my aim is to extract transform and load files uploaded in S3 bucket to RDS instance. Also I need to transfer the files into separate S3 buckets based on the Glue Job status (Success /Failure). There will be more than one file uploaded into the initial S3 bucket. How can I get the name of the files uploaded so that i can transfer those files to appropriate buckets.
Step 1: Upload files to S3 bucket1.
Step 2: Trigger lamda function to call Job1
Step 3: On success of job1 transfer file to S3 bucket2
Step 4: On failure transfer to another S3 bucket
Have a lambda event trigger listening to the folder you are uploading
the files to S3 In the lambda, use AWS Glue API to run the glue job
(essentially a python script in AWS Glue).
In Glue python script, use the appropriate library, such as pymysql, etc.
as an external library packaged with your python script.
Perform data load operations from S3 to your RDS tables. If you are
using Aurora Mysql, then AWS has provided a nice feature
"load from S3", so you can directly load the
file into the tables (you may need to do some configurations in the
PARAMETER GROUP / IAM Roles).
Lambda script to call glue job:
s3 = boto3.client('s3')
glue = boto3.client('glue')
def lambda_handler(event, context):
gluejobname="<YOUR GLUE JOB NAME>"
try:
runId = glue.start_job_run(JobName=gluejobname)
status = glue.get_job_run(JobName=gluejobname, RunId=runId['JobRunId'])
print("Job Status : ", status['JobRun']['JobRunState'])
except Exception as e:
print(e)
raise e
Glue Script:
import mysql.connector
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.context import DynamicFrame
from awsglue.transforms import *
from pyspark.sql.types import StringType
from pyspark.sql.types import DateType
from pyspark.sql.functions import unix_timestamp, from_unixtime
from pyspark.sql import SQLContext
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
url="<RDS URL>"
uname="<USER NAME>"
pwd="<PASSWORD>"
dbase="DBNAME"
def connect():
conn = mysql.connector.connect(host=url, user=uname, password=pwd, database=dbase)
cur = conn.cursor()
return cur, conn
def create_stg_table():
cur, conn = connect()
createStgTable1 = <CREATE STAGING TABLE IF REQUIRED>
loadQry = "LOAD DATA FROM S3 PREFIX 'S3://PATH FOR YOUR CSV' REPLACE INTO TABLE <DB.TABLENAME> FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' IGNORE 1 LINES (#var1, #var2, #var3, #var4, #var5, #var6, #var7, #var8) SET ......;"
cur.execute(createStgTable1)
cur.execute(loadQry)
conn.commit()
conn.close()
You can then create a cloudwatch alert wherein check for the glue job status, and depending upon the status, perform file copy operations between S3. We have similar setup in our production.
Regards
Yuva