Set spark configuration in aws glue pyspark - amazon-web-services

I am using AWS Glue with pySpark and want to add a couple of configurations in the sparkSession, e.g. '"spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem",spark.hadoop.fs.s3a.multiobjectdelete.enable","false", "spark.serializer", "org.apache.spark.serializer.KryoSerializer", "spark.hadoop.fs.s3a.fast.upload","true". The code I am using to initialise the context is the following:
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
From what I understood from the documentation is that I should add these confs as job parameters when submitting the glue jobs. Is that the case or can they also be added when initializing the spark?

This doesn't seem to be erroring out - not sure if it's working
hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("spark.hadoop.fs.s3.maxRetries", "20")
hadoop_conf.set("spark.hadoop.fs.s3.consistent.retryPolicyType", "exponential")

Related

How to set Spark Config in an AWS Glue job, using Scala Spark?

When running my job, I am getting the following exception:
Exception in User Class: org.apache.spark.SparkException : Job aborted due to stage failure: Task 32 in stage 2.0 failed 4 times, most recent failure: Lost task 32.3 in stage 2.0 (TID 50) (10.100.1.48 executor 8): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is.
I have tried to apply the requested configuration value, as follows:
val conf = new SparkConf()
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY")
val spark: SparkContext = new SparkContext(conf)
//Get current sparkconf which is set by glue
val glueContext: GlueContext = new GlueContext(spark)
val args = GlueArgParser.getResolvedOptions(
sysArgs,
Seq("JOB_NAME").toArray
)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
but the same error occurs. I have also tried setting it to "CORRECTED" via the same approach.
It seems that the config is not properly making its way into the Spark execution. What is the proper way to get Spark config values set from a ScalaSpark job on Glue?
When you are migrating between versions it is always best to check out the Migration guides by AWS. In your case this can be set in your Glue Job properties by passing below properties as per requirement.To set these navigate to Glue console -> Jobs -> Click on Job -> Job details -> Advanced properties -> Job parameters.
- Key: --conf
- Value: spark.sql.legacy.parquet.int96RebaseModeInRead=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=[CORRECTED|LEGACY]
Please refer to below guide for the more information:
https://docs.aws.amazon.com/glue/latest/dg/migrating-version-30.html#migrating-version-30-from-20
This code at the top of my glue job seems to have done the trick
val conf = new SparkConf()
//alternatively, use LEGACY if that is required
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
val spark: SparkContext = new SparkContext(conf)
val glueContext: GlueContext = new GlueContext(spark)

AWS Glue Oracle R12 Connection Successful but then timeout

I have a connection from AWS Glue to Oracle R12 and it seems to work fine when I test it in the "connections" section of AWS Glue:
p-*-oracleconnection connected successfully to your instance.
I can crawl all the tables etc. and get the whole schema without a problem.
However as soon as I try to use these crawled tables in a Glue Job I get this:
py4j.protocol.Py4JJavaError: An error occurred while calling o64.getDynamicFrame.
: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
Connection String (Sanitised obviously)
jdbc:oracle:thin://#xxx.xxx.xxx.xxx:1000:FOOBAR
Loading into DynamicFrame
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database=args['INPUT_DATABASE'],
table_name=args['INPUT_TABLE_NAME'],
transformation_ctx="datasource0",
)
where the Glue job arguments are:
--INPUT_DATABASE p-*-source-database
--INPUT_TABLE_NAME foobar_xx_xx_animals
Which I have validated and both exist in AWS Glue
Reasons I have to stay using Spark on Glue:
Job Bookmark
Reasons I have to use Glues built in connections and not direct from Spark:
VPC is needed
I just don't understand why I can crawl all the tables and get all the metadata but as soon as I try to load this into a DynamicFrame it errors out...

How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job?
---------------------S3 image ---------------------
Ok finally after few days of testing I found the solution. Before pasting the code let me summarize what I have found ...
Those $folder$ are created via Hadoop .Apache Hadoop creates these files when to create a folder in an S3 bucket. Source1
They are actually directory markers as path + /. Source 2
To change the behavior , you need to change the Hadoop S3 write configuration in Spark context. Read this and this and this
Read about S3 , S3a and S3n here and here
Thanks to #stevel 's comment here
Now the solution is to set the following configuration in Spark context Hadoop.
sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
To avoid creation of SUCCESS files you need to set the following configuration as well :
hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
Make sure you use the S3 URI for writing to s3 bucket. ex:
myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])

AWS Glue with Athena

We are in a phase where we are migrating all of our spark job written in scala to aws glue.
Current Flow:
Apache Hive -> Spark(Processing/Transformation) -> Apache Hive -> BI
Required Flow:
AWS S3(Athena) -> Aws Glue(Spark Scala -> Processing/Transformation) -> AWS S3 -> Athena -> BI
TBH i got this task yesterday and i am doing R&D on it. My questions are :
Can we run same code in apache glue as it has dynamic frame which
can be converted to dataframes but require changes in code.
Can we read data from aws athena using spark sql api in aws glue
like we normally do in spark.
Aws glue extends the capabilities of Apache Spark. Hence you can always use your code as it is.
The only changes you need to do is to change the creation of session variable and parsing of arguments provided. You can run plain old pyspark code without even creating dynamic frames.
def createSession():
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
return sc, glueContext, spark, job
#To handle the arguments
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'arg1', 'arg2'])
arg1 = args['arg1'].split(',')
arg2 = args['arg2'].strip()
#To initialize the job
job.init(args['JOB_NAME'], args)
#your code here
job.commit()
And it also supports spark sql over glue catalog.
Hope it helps
I am able to run my current code with minor changes.
i have built sparkSession and use that session to query glue hive enabled catalog table.
we need to add this parameter in our job --enable-glue-datacatalog
SparkSession.builder().appName("SPARK-DEVELOPMENT").getOrCreate()
var sqlContext = a.sqlContext
sqlContext.sql("use default")
sqlContext.sql("select * from testhive").show()

Referencing a Hive view from within an AWS Glue job

I'm trying to figure out how to migrate a use case from EMR to AWS Glue involving Hive views.
In EMR today, I have Hive external tables backed by Parquet in S3, and I have additional views like create view hive_view as select col from external_table where col = x
Then in Spark on EMR, I can issue statements like df = spark.sql("select * from hive_view") to reference my Hive view.
I am aware I can use the Glue catalog as a drop-in replacement for the Hive metastore, but I'm trying to migrate the Spark job itself off of EMR to Glue. So in my end state, there is no longer a Hive endpoint, only Glue.
Questions:
How do I replace the create view ... statement if I no longer have an EMR cluster to issue Hive commands? What's the equivalent AWS Glue SDK call?
How do I reference those views from within a Glue job?
What I've tried so far: using boto3 to call glue.create_table like this
glue = boto3.client('glue')
glue.create_table(DatabaseName='glue_db_name',
TableInput = {'Name': 'hive_view',
'TableType': 'VIRTUAL_VIEW',
'ViewExpandedText': 'select .... from ...'
})
I can see the object created in the Glue catalog but the classification shows as "Unknown" and the references in the job fail with a corresponding error:
py4j.protocol.Py4JJavaError: An error occurred while calling o56.getCatalogSource. :
java.lang.Error: No classification or connection in bill_glue_poc.view_test at ...
I have validated that I can use Hive views with Spark in EMR with the Glue catalog as the metastore -- I see the view in the Glue catalog, and Spark SQL queries succeed, but I cannot reference the view from within a Glue job.
You can create a temporary view in Spark and query it like a Hive table (Scala):
val dataDyf = glueContext.getSourceWithFormat(
connectionType = "s3",
format = "parquet",
options = JsonOptions(Map(
"paths" -> Array("s3://bucket/external/folder")
))).getDynamicFrame()
// Convert DynamicFrame to Spark's DataFrame and apply filtering
val dataViewDf = dataDyf.toDF().where(...)
dataViewDf.createOrReplaceTempView("hive_view")
val df = spark.sql("select * from hive_view")