I have a connection from AWS Glue to Oracle R12 and it seems to work fine when I test it in the "connections" section of AWS Glue:
p-*-oracleconnection connected successfully to your instance.
I can crawl all the tables etc. and get the whole schema without a problem.
However as soon as I try to use these crawled tables in a Glue Job I get this:
py4j.protocol.Py4JJavaError: An error occurred while calling o64.getDynamicFrame.
: java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
Connection String (Sanitised obviously)
jdbc:oracle:thin://#xxx.xxx.xxx.xxx:1000:FOOBAR
Loading into DynamicFrame
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database=args['INPUT_DATABASE'],
table_name=args['INPUT_TABLE_NAME'],
transformation_ctx="datasource0",
)
where the Glue job arguments are:
--INPUT_DATABASE p-*-source-database
--INPUT_TABLE_NAME foobar_xx_xx_animals
Which I have validated and both exist in AWS Glue
Reasons I have to stay using Spark on Glue:
Job Bookmark
Reasons I have to use Glues built in connections and not direct from Spark:
VPC is needed
I just don't understand why I can crawl all the tables and get all the metadata but as soon as I try to load this into a DynamicFrame it errors out...
Related
When running my job, I am getting the following exception:
Exception in User Class: org.apache.spark.SparkException : Job aborted due to stage failure: Task 32 in stage 2.0 failed 4 times, most recent failure: Lost task 32.3 in stage 2.0 (TID 50) (10.100.1.48 executor 8): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the datetime values as it is.
I have tried to apply the requested configuration value, as follows:
val conf = new SparkConf()
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY")
val spark: SparkContext = new SparkContext(conf)
//Get current sparkconf which is set by glue
val glueContext: GlueContext = new GlueContext(spark)
val args = GlueArgParser.getResolvedOptions(
sysArgs,
Seq("JOB_NAME").toArray
)
Job.init(args("JOB_NAME"), glueContext, args.asJava)
but the same error occurs. I have also tried setting it to "CORRECTED" via the same approach.
It seems that the config is not properly making its way into the Spark execution. What is the proper way to get Spark config values set from a ScalaSpark job on Glue?
When you are migrating between versions it is always best to check out the Migration guides by AWS. In your case this can be set in your Glue Job properties by passing below properties as per requirement.To set these navigate to Glue console -> Jobs -> Click on Job -> Job details -> Advanced properties -> Job parameters.
- Key: --conf
- Value: spark.sql.legacy.parquet.int96RebaseModeInRead=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.int96RebaseModeInWrite=[CORRECTED|LEGACY] --conf spark.sql.legacy.parquet.datetimeRebaseModeInRead=[CORRECTED|LEGACY]
Please refer to below guide for the more information:
https://docs.aws.amazon.com/glue/latest/dg/migrating-version-30.html#migrating-version-30-from-20
This code at the top of my glue job seems to have done the trick
val conf = new SparkConf()
//alternatively, use LEGACY if that is required
conf.set("spark.sql.legacy.parquet.int96RebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "CORRECTED")
conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED")
val spark: SparkContext = new SparkContext(conf)
val glueContext: GlueContext = new GlueContext(spark)
I have a Glue ETL Job (using pyspark) that gives a timeout error when trying to access the awsglueml.transforms.FindMatches library seemingly randomly. The error given on the glue dashboard is:
An error occurred while calling z:com.amazonaws.services.glue.ml.FindMatches.apply. The target server failed to respond
Basically if I try to run this Glue ETL job late at night, it most of the time succeeds. But if I try to run this ETL Job in the middle of the day, it fails with this error. Sometimes just retrying it enough times causes it to succeed, but this doesn't seem like a good solution. It seems like the issue is with AWS FindMatches library not having enough bandwidth to support people wanting to use this library, but I could be wrong here.
The Glue ETL job was setup using the option A proposed script generated by AWS Glue
The line of code that this is timing out on is a line that was provided by glue when I created this job:
from awsglueml.transforms import FindMatches
...
findmatches2 = FindMatches.apply(frame = datasource0, transformId = "<redacted>", computeMatchConfidenceScores = True, transformation_ctx = "findmatches2")
Welcoming any information on this elusive issue.
I am using AWS Glue with pySpark and want to add a couple of configurations in the sparkSession, e.g. '"spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem",spark.hadoop.fs.s3a.multiobjectdelete.enable","false", "spark.serializer", "org.apache.spark.serializer.KryoSerializer", "spark.hadoop.fs.s3a.fast.upload","true". The code I am using to initialise the context is the following:
glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session
From what I understood from the documentation is that I should add these confs as job parameters when submitting the glue jobs. Is that the case or can they also be added when initializing the spark?
This doesn't seem to be erroring out - not sure if it's working
hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("spark.hadoop.fs.s3.maxRetries", "20")
hadoop_conf.set("spark.hadoop.fs.s3.consistent.retryPolicyType", "exponential")
We are in a phase where we are migrating all of our spark job written in scala to aws glue.
Current Flow:
Apache Hive -> Spark(Processing/Transformation) -> Apache Hive -> BI
Required Flow:
AWS S3(Athena) -> Aws Glue(Spark Scala -> Processing/Transformation) -> AWS S3 -> Athena -> BI
TBH i got this task yesterday and i am doing R&D on it. My questions are :
Can we run same code in apache glue as it has dynamic frame which
can be converted to dataframes but require changes in code.
Can we read data from aws athena using spark sql api in aws glue
like we normally do in spark.
Aws glue extends the capabilities of Apache Spark. Hence you can always use your code as it is.
The only changes you need to do is to change the creation of session variable and parsing of arguments provided. You can run plain old pyspark code without even creating dynamic frames.
def createSession():
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
return sc, glueContext, spark, job
#To handle the arguments
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'arg1', 'arg2'])
arg1 = args['arg1'].split(',')
arg2 = args['arg2'].strip()
#To initialize the job
job.init(args['JOB_NAME'], args)
#your code here
job.commit()
And it also supports spark sql over glue catalog.
Hope it helps
I am able to run my current code with minor changes.
i have built sparkSession and use that session to query glue hive enabled catalog table.
we need to add this parameter in our job --enable-glue-datacatalog
SparkSession.builder().appName("SPARK-DEVELOPMENT").getOrCreate()
var sqlContext = a.sqlContext
sqlContext.sql("use default")
sqlContext.sql("select * from testhive").show()
I'm trying to figure out how to migrate a use case from EMR to AWS Glue involving Hive views.
In EMR today, I have Hive external tables backed by Parquet in S3, and I have additional views like create view hive_view as select col from external_table where col = x
Then in Spark on EMR, I can issue statements like df = spark.sql("select * from hive_view") to reference my Hive view.
I am aware I can use the Glue catalog as a drop-in replacement for the Hive metastore, but I'm trying to migrate the Spark job itself off of EMR to Glue. So in my end state, there is no longer a Hive endpoint, only Glue.
Questions:
How do I replace the create view ... statement if I no longer have an EMR cluster to issue Hive commands? What's the equivalent AWS Glue SDK call?
How do I reference those views from within a Glue job?
What I've tried so far: using boto3 to call glue.create_table like this
glue = boto3.client('glue')
glue.create_table(DatabaseName='glue_db_name',
TableInput = {'Name': 'hive_view',
'TableType': 'VIRTUAL_VIEW',
'ViewExpandedText': 'select .... from ...'
})
I can see the object created in the Glue catalog but the classification shows as "Unknown" and the references in the job fail with a corresponding error:
py4j.protocol.Py4JJavaError: An error occurred while calling o56.getCatalogSource. :
java.lang.Error: No classification or connection in bill_glue_poc.view_test at ...
I have validated that I can use Hive views with Spark in EMR with the Glue catalog as the metastore -- I see the view in the Glue catalog, and Spark SQL queries succeed, but I cannot reference the view from within a Glue job.
You can create a temporary view in Spark and query it like a Hive table (Scala):
val dataDyf = glueContext.getSourceWithFormat(
connectionType = "s3",
format = "parquet",
options = JsonOptions(Map(
"paths" -> Array("s3://bucket/external/folder")
))).getDynamicFrame()
// Convert DynamicFrame to Spark's DataFrame and apply filtering
val dataViewDf = dataDyf.toDF().where(...)
dataViewDf.createOrReplaceTempView("hive_view")
val df = spark.sql("select * from hive_view")