Retrieve RDS credentials from Secrets Manager in AWS Glue Python Script - amazon-web-services

I have a Glue Script which is trying to read the RDS credentials I have stored in Secrets manager. But the Script keeps on running and never completes.
Also, the IAM Role which this Glue Script is running with contains SecretsManagerReadWrite policy (AWS Managed)
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrameCollection
from awsglue.dynamicframe import DynamicFrame
import boto3
import botocore
from botocore.errorfactory import ClientError
# import org.apache.spark.sql.functions.concat_ws
from pyspark.sql.types import *
from pyspark.sql.functions import udf
from datetime import date
today = date.today()
current_day = today.strftime("%Y%m%d")
def str_to_arr(my_list):
str = ""
for item in my_list:
if item:
str += item
str = str.split(" ")
return '{"' + ' '.join([elem for elem in str]) + '"}'
str_to_arr_udf = udf(str_to_arr,StringType())
def AddPartitionKeys(glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0]).toDF()
df = glueContext.add_ingestion_time_columns(df, "day")
glue_df = DynamicFrame.fromDF(df, glueContext, "transform_date")
return(DynamicFrameCollection({"CustomTransform0": glue_df}, glueContext))
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'days', 's3_bucket', 'rds_endpoint', 'region_name', 'secret_name'])
region_name = args['region_name']
session = boto3.session.Session()
client = session.client("secretsmanager", region_name=region_name)
get_secret_value_response = client.get_secret_value(SecretId=args['secret_name'])
secret = get_secret_value_response['SecretString']
secret = json.loads(secret)
db_username = secret.get('username')
db_password = secret.get('password')
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
print("Below are the creds")
# print("DB USERNAME IS " , db_username)
# print("DB PWD IS " , db_password)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
job.commit()
What am I missing here?
I checked my work against this blog and yet I am not able to get this script complete successfully.

After Mark's suggestion, I was able to figure out that I had to create a VPC Interface Endpoint for Secrets Manager. The steps are outlined here by AWS, just had to make sure the policy in the endpoint had access/ARNs mentioned of resources I want to access from Secrets Manager.

Related

An error occurred while calling o98.getDynamicFrame. ERROR: column "id" does not exist

i am using glue to transfer data from postgreSql db to another postgreSql db, i always have issue on the id as it is declared as primary key, but when the primary key tag is remove on the database there would be no error.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node PostgreSQL
PostgreSQL_node1654086903275 = glueContext.create_dynamic_frame.from_catalog(
database="my_db",
table_name="test_table_1",
transformation_ctx="PostgreSQL_node1654086903275",
)
# Script generated for node Rename Field
RenameField_node1654086935942 = RenameField.apply(
frame=PostgreSQL_node1654086903275,
old_name="id",
new_name="Id",
transformation_ctx="RenameField_node1654086935942",
)
# Script generated for node PostgreSQL
PostgreSQL_node1654086963634 = glueContext.write_dynamic_frame.from_catalog(
frame=RenameField_node1654086935942,
database="my_db",
table_name="test_table_",
transformation_ctx="PostgreSQL_node1654086963634",
)
job.commit()

AWS Glue- How to output only 1 latest file in s3 bucket

I use AWS Glue and Apache Hudi to replicate data in RDS to S3. If I execute the following job, 2 parquet files (initial one, and updated one) will be generated in the S3 bucket (basePath). In this case, I want only 1 latest file, and would like to delete old one.
Does anyone know how to keep 1 latest file in the bucket?
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
spark = SparkSession.builder.config('spark.serializer','org.apache.spark.serializer.KryoSerializer').getOrCreate()
sc = spark.sparkContext
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateInserts(5))
df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
df.show()
tableName = 'hudi_mor_athena_sample'
bucketName = 'cm-sato-hudi-sample--datalake'
basePath = f's3://{bucketName}/{tableName}'
hudi_options = {
'hoodie.table.name': tableName,
'hoodie.datasource.write.storage.type': 'MERGE_ON_READ',
'hoodie.compact.inline': 'false',
'hoodie.datasource.write.recordkey.field': 'uuid',
'hoodie.datasource.write.partitionpath.field': 'partitionpath',
'hoodie.datasource.write.table.name': tableName,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.datasource.write.precombine.field': 'ts',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2,
}
df.write.format("hudi"). \
options(**hudi_options). \
mode("overwrite"). \
save(basePath)
updates = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(dataGen.generateUpdates(3))
df = spark.read.json(spark.sparkContext.parallelize(updates, 2))
df.show()
# update
df.write.format("hudi"). \
options(**hudi_options). \
mode("append"). \
save(basePath)
job.commit()
Instead of mode("append") use mode("overwrite")

AWS Glue NameError: name 'DynamicFrame' is not defined

I'm trying to convert a dataframe to a Dynamic Frame using the toDF and fromDF functions (https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-fromDF) as per the below code snippet:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
## #type: DataSource
## #args: [database = "test-3", table_name = "test", transformation_ctx = "datasource0"]
## #return: datasource0
## #inputs: []
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "test-3", table_name = "test", transformation_ctx = "datasource0")
foo = datasource0.toDF()
bar = DynamicFrame.fromDF(foo, glueContext, "bar")
However, I'm getting an error on the line:
bar = DynamicFrame.fromDF(foo, glueContext, "bar")
The error says:
NameError: name 'DynamicFrame' is not defined
I've tried the usual googling to no avail, I can't see what I've done wrong from other examples. Does anyone know why I'm getting this error and how to resolve it?
from awsglue.dynamicframe import DynamicFrame
Import DynamicFrame
You need to import the DynamicFrame class from awsglue.dynamicframe module:
from awsglue.dynamicframe import DynamicFrame
There are lot of things missing in the examples provided with the AWS Glue ETL documentation.
However, you can refer to the following GitHub repository which contains lots of examples for performing basic tasks with Glue ETL:
AWS Glue samples

How to connect to hive installed on an ec2 instance from aws glue?

I want to access hive metastore by running a spark job on AWS Glue. Doing so requires me to put the hive's instance's ip and access it. From my local, it works but not from AWS Glue.
I have tried to access Hive using the following piece of code:
spark_session = (
glueContext.spark_session
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config(
"hive.metastore.uris",
"thrift://172.16.12.34:9083",
conf=SparkConf()
)
.enableHiveSupport()
.getOrCreate()
)
I have also looked at various documentations but none could tell my how to connect to an ec2 instance at a specific port.
The code is:
import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark import SparkConf, SparkContext
from pyspark.conf import SparkConf
from pyspark.context import SparkConf, SparkContext
from pyspark.sql import (DataFrameReader, DataFrameWriter, HiveContext,
SparkSession)
"""
SparkSession ss = SparkSession
.builder()
.appName(" Hive example")
.config("hive.metastore.uris", "thrift://localhost:9083")
.enableHiveSupport()
.getOrCreate();
"""
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark_session = (
glueContext.spark_session
.builder
.appName('example-pyspark-read-and-write-from-hive')
.config(
"hive.metastore.uris",
"thrift://172.16.12.34:9083",
conf=SparkConf()
)
.enableHiveSupport()
.getOrCreate()
)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = spark_session.createDataFrame(data)
df.write.saveAsTable('example_2')
job.commit()
I expect to get the table written in Hive but instead I get the following error from Glue:
An error occurred while calling o239.saveAsTable. No Route to Host from ip-172-31-14-64/172.31.14.64 to ip-172-31-15-11.ap-south-1.compute.internal:8020 failed on socket timeout exception: java.net.NoRouteToHostException: No route to host;

How configure aws glue jobs to use column types from glue datalake table definition?

Consider the following aws glue job code:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
medicare_dynamicframe = glueContext.create_dynamic_frame.from_catalog(
database = "my_database",
table_name = "my_table")
medicare_dynamicframe.printSchema()
job.commit()
It prints something like that (note that price_key is not on second position):
root
|-- day_key: string
...
|-- price_key: string
While my_table in datalake is defined with day_key as int (first column) and price_key as decimal(25,0) (second column).
May be I am wrong but I spot from sources that aws glue uses table and database to get just s3 path to data but completelly ignores any type definitions. May be for some data formats like parquet it is normal, but not for csv.
How configure aws glue to set schema from datalake table defintion for dynamic frame with csv?