AWS Glue with Athena - amazon-web-services

We are in a phase where we are migrating all of our spark job written in scala to aws glue.
Current Flow:
Apache Hive -> Spark(Processing/Transformation) -> Apache Hive -> BI
Required Flow:
AWS S3(Athena) -> Aws Glue(Spark Scala -> Processing/Transformation) -> AWS S3 -> Athena -> BI
TBH i got this task yesterday and i am doing R&D on it. My questions are :
Can we run same code in apache glue as it has dynamic frame which
can be converted to dataframes but require changes in code.
Can we read data from aws athena using spark sql api in aws glue
like we normally do in spark.

Aws glue extends the capabilities of Apache Spark. Hence you can always use your code as it is.
The only changes you need to do is to change the creation of session variable and parsing of arguments provided. You can run plain old pyspark code without even creating dynamic frames.
def createSession():
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
return sc, glueContext, spark, job
#To handle the arguments
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'arg1', 'arg2'])
arg1 = args['arg1'].split(',')
arg2 = args['arg2'].strip()
#To initialize the job
job.init(args['JOB_NAME'], args)
#your code here
job.commit()
And it also supports spark sql over glue catalog.
Hope it helps

I am able to run my current code with minor changes.
i have built sparkSession and use that session to query glue hive enabled catalog table.
we need to add this parameter in our job --enable-glue-datacatalog
SparkSession.builder().appName("SPARK-DEVELOPMENT").getOrCreate()
var sqlContext = a.sqlContext
sqlContext.sql("use default")
sqlContext.sql("select * from testhive").show()

Related

How to set Spark Config in an AWS Glue job, using Scala Spark?

When running my job, I am getting the following exception:
Exception in User Class: org.apache.spark.SparkException : Job aborted due to stage failure: Task 32 in stage 2.0 failed 4 times, most recent failure: Lost task 32.3 in stage 2.0 (TID 50) (10.100.1.48 executor 8): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is.
I have tried to apply the requested configuration value, as follows:
val conf = new SparkConf()
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY")
val spark: SparkContext = new SparkContext(conf)
//Get current sparkconf which is set by glue
val glueContext: GlueContext = new GlueContext(spark)
val args = GlueArgParser.getResolvedOptions(
sysArgs,
Seq("JOB_NAME").toArray
)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
but the same error occurs. I have also tried setting it to "CORRECTED" via the same approach.
It seems that the config is not properly making its way into the Spark execution. What is the proper way to get Spark config values set from a ScalaSpark job on Glue?
When you are migrating between versions it is always best to check out the Migration guides by AWS. In your case this can be set in your Glue Job properties by passing below properties as per requirement.To set these navigate to Glue console -> Jobs -> Click on Job -> Job details -> Advanced properties -> Job parameters.
- Key: --conf
- Value: spark.sql.legacy.parquet.int96RebaseModeInRead=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=[CORRECTED|LEGACY]
Please refer to below guide for the more information:
https://docs.aws.amazon.com/glue/latest/dg/migrating-version-30.html#migrating-version-30-from-20
This code at the top of my glue job seems to have done the trick
val conf = new SparkConf()
//alternatively, use LEGACY if that is required
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
val spark: SparkContext = new SparkContext(conf)
val glueContext: GlueContext = new GlueContext(spark)

AWS Glue Oracle R12 Connection Successful but then timeout

I have a connection from AWS Glue to Oracle R12 and it seems to work fine when I test it in the "connections" section of AWS Glue:
p-*-oracleconnection connected successfully to your instance.
I can crawl all the tables etc. and get the whole schema without a problem.
However as soon as I try to use these crawled tables in a Glue Job I get this:
py4j.protocol.Py4JJavaError: An error occurred while calling o64.getDynamicFrame.
: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
Connection String (Sanitised obviously)
jdbc:oracle:thin://#xxx.xxx.xxx.xxx:1000:FOOBAR
Loading into DynamicFrame
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database=args['INPUT_DATABASE'],
table_name=args['INPUT_TABLE_NAME'],
transformation_ctx="datasource0",
)
where the Glue job arguments are:
--INPUT_DATABASE p-*-source-database
--INPUT_TABLE_NAME foobar_xx_xx_animals
Which I have validated and both exist in AWS Glue
Reasons I have to stay using Spark on Glue:
Job Bookmark
Reasons I have to use Glues built in connections and not direct from Spark:
VPC is needed
I just don't understand why I can crawl all the tables and get all the metadata but as soon as I try to load this into a DynamicFrame it errors out...

Set spark configuration in aws glue pyspark

I am using AWS Glue with pySpark and want to add a couple of configurations in the sparkSession, e.g. '"spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem",spark.hadoop.fs.s3a.multiobjectdelete.enable","false", "spark.serializer", "org.apache.spark.serializer.KryoSerializer", "spark.hadoop.fs.s3a.fast.upload","true". The code I am using to initialise the context is the following:
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
From what I understood from the documentation is that I should add these confs as job parameters when submitting the glue jobs. Is that the case or can they also be added when initializing the spark?
This doesn't seem to be erroring out - not sure if it's working
hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("spark.hadoop.fs.s3.maxRetries", "20")
hadoop_conf.set("spark.hadoop.fs.s3.consistent.retryPolicyType", "exponential")

Executing a Redshift procedure through AWS Glue

I have created Stored procedures on Redshift and need to orchestrate it. The SP contains the DML statements for SCD creation and is limited to Redshift.
Is there a way on AWS to run the SP on Redshift through Glue or any other AWS services?
As we do not have triggers on RS I am exploring other options. Help is highly appreciated.
I think you can try making use of preactions/Postactions. Preactions/Postactions allow you to execute sql commands before/after your dynamic frames processes data. You can provide a list of semi-colon delimited commands, for example just normal sql commands, you could try call the procedures using the same approach:
datasink5 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = prod_dynamicframe, catalog_connection = "my_rdshft", connection_options = {"preactions":"delete from dw.product_dim where sku in ('xxxxx,'bbbb');","dbtable": "dw.product_dim", "database": "DWBI","postactions":"truncate table ld_stg.ld_product_tbl;"}, redshift_tmp_dir = args["TempDir"], transformation_ctx = "datasink5")
This might also be helpful.
One approach you can try is preactions and postactions as mentioned by #Eman, I haven't tried it .
But I used psycopg2 to trigger the stored procedure on redshift.
Just zip the package and pass to glue.
Establish a jdbc connection
and use callproc() function to call the stored procedure.
Find its usage https://www.psycopg.org/docs/usage.html

Referencing a Hive view from within an AWS Glue job

I'm trying to figure out how to migrate a use case from EMR to AWS Glue involving Hive views.
In EMR today, I have Hive external tables backed by Parquet in S3, and I have additional views like create view hive_view as select col from external_table where col = x
Then in Spark on EMR, I can issue statements like df = spark.sql("select * from hive_view") to reference my Hive view.
I am aware I can use the Glue catalog as a drop-in replacement for the Hive metastore, but I'm trying to migrate the Spark job itself off of EMR to Glue. So in my end state, there is no longer a Hive endpoint, only Glue.
Questions:
How do I replace the create view ... statement if I no longer have an EMR cluster to issue Hive commands? What's the equivalent AWS Glue SDK call?
How do I reference those views from within a Glue job?
What I've tried so far: using boto3 to call glue.create_table like this
glue = boto3.client('glue')
glue.create_table(DatabaseName='glue_db_name',
TableInput = {'Name': 'hive_view',
'TableType': 'VIRTUAL_VIEW',
'ViewExpandedText': 'select .... from ...'
})
I can see the object created in the Glue catalog but the classification shows as "Unknown" and the references in the job fail with a corresponding error:
py4j.protocol.Py4JJavaError: An error occurred while calling o56.getCatalogSource. :
java.lang.Error: No classification or connection in bill_glue_poc.view_test at ...
I have validated that I can use Hive views with Spark in EMR with the Glue catalog as the metastore -- I see the view in the Glue catalog, and Spark SQL queries succeed, but I cannot reference the view from within a Glue job.
You can create a temporary view in Spark and query it like a Hive table (Scala):
val dataDyf = glueContext.getSourceWithFormat(
connectionType = "s3",
format = "parquet",
options = JsonOptions(Map(
"paths" -> Array("s3://bucket/external/folder")
))).getDynamicFrame()
// Convert DynamicFrame to Spark's DataFrame and apply filtering
val dataViewDf = dataDyf.toDF().where(...)
dataViewDf.createOrReplaceTempView("hive_view")
val df = spark.sql("select * from hive_view")