Referencing a Hive view from within an AWS Glue job - amazon-web-services

I'm trying to figure out how to migrate a use case from EMR to AWS Glue involving Hive views.
In EMR today, I have Hive external tables backed by Parquet in S3, and I have additional views like create view hive_view as select col from external_table where col = x
Then in Spark on EMR, I can issue statements like df = spark.sql("select * from hive_view") to reference my Hive view.
I am aware I can use the Glue catalog as a drop-in replacement for the Hive metastore, but I'm trying to migrate the Spark job itself off of EMR to Glue. So in my end state, there is no longer a Hive endpoint, only Glue.
Questions:
How do I replace the create view ... statement if I no longer have an EMR cluster to issue Hive commands? What's the equivalent AWS Glue SDK call?
How do I reference those views from within a Glue job?
What I've tried so far: using boto3 to call glue.create_table like this
glue = boto3.client('glue')
glue.create_table(DatabaseName='glue_db_name',
TableInput = {'Name': 'hive_view',
'TableType': 'VIRTUAL_VIEW',
'ViewExpandedText': 'select .... from ...'
})
I can see the object created in the Glue catalog but the classification shows as "Unknown" and the references in the job fail with a corresponding error:
py4j.protocol.Py4JJavaError: An error occurred while calling o56.getCatalogSource. :
java.lang.Error: No classification or connection in bill_glue_poc.view_test at ...
I have validated that I can use Hive views with Spark in EMR with the Glue catalog as the metastore -- I see the view in the Glue catalog, and Spark SQL queries succeed, but I cannot reference the view from within a Glue job.

You can create a temporary view in Spark and query it like a Hive table (Scala):
val dataDyf = glueContext.getSourceWithFormat(
connectionType = "s3",
format = "parquet",
options = JsonOptions(Map(
"paths" -> Array("s3://bucket/external/folder")
))).getDynamicFrame()
// Convert DynamicFrame to Spark's DataFrame and apply filtering
val dataViewDf = dataDyf.toDF().where(...)
dataViewDf.createOrReplaceTempView("hive_view")
val df = spark.sql("select * from hive_view")

Related

AWS Glue enableUpdateCatalog not creating new partitions after successful job run

I am having a problem, where i have set enableUpdateCatalog=True and also updateBehaviour=LOG to update my glue table which has 1 partition key. After the job, runs there are no new partitions added on my glue catalog table, but data in S3 is separated by the partition key i have used, how do i get the job to automatically partition my glue catalog table?
Currently i have to manually run boto3 create_partition to create partitions on my glue catalog table. I want my job to automatically be able to create partitions as it discovers in S3 path separated by partition Keys
Code:
additionalOptions = {
"enableUpdateCatalog": True,
"updateBehavior": "LOG"}
additionalOptions["partitionKeys"] = ["partition_key0", "partition_key1"]
my_df = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<dst_db_name>,
table_name=<dst_tbl_name>, transformation_ctx="DataSink1",
additional_options=additionalOptions)
job.commit()
PS: I am currently using PARQUET format
Am i missing any Rights that has to be added to my job so that it can create partitions from the job itself?
I got it to work by adding useGlueParquetWriter: 'true' to the CATALOG table properties. And also I have added
format_options = {
'useGlueParquetWriter': True
}
in the write_dynamic_frame.from_catalog calls.
These steps got it to start working :)

Can I write custom query in Google BigQuery Connector for AWS Glue?

I'm creating a Glue ETL job that transfers data from BigQuery to S3. Similar to this example, but with my own dataset.
n.b.: I use BigQuery Connector for AWS Glue v0.22.0-2 (link).
The data in BigQuery is already partitioned by date, and I would like to have every Glue job run fetches a specific date only (WHERE date = ...) and group them into 1 CSV file output. But I don't find any clue where to insert the custom WHERE query.
In BigQuery source node configuration options, the options are only these:
Also in the generated script, it uses create_dynamic_frame.from_options which does not accommodate custom query (per documentation).
# Script generated for node Google BigQuery Connector 0.22.0 for AWS Glue 3.0
GoogleBigQueryConnector0220forAWSGlue30_node1 = (
glueContext.create_dynamic_frame.from_options(
connection_type="marketplace.spark",
connection_options={
"parentProject": args["BQ_PROJECT"],
"table": args["BQ_TABLE"],
"connectionName": args["BQ_CONNECTION_NAME"],
},
transformation_ctx="GoogleBigQueryConnector0220forAWSGlue30_node1",
)
)
So, is there any way I can write a custom query? Or is there any alternative method?
Quoting this AWS sample project, we can use filter in Connection Options:
filter – Passes the condition to select the rows to convert. If the table is partitioned, the selection is pushed down and only the rows in the specified partition are transferred to AWS Glue. In all other cases, all data is scanned and the filter is applied in AWS Glue Spark processing, but it still helps limit the amount of memory used in total.
Example if used in script:
# Script generated for node Google BigQuery Connector 0.22.0 for AWS Glue 3.0
GoogleBigQueryConnector0220forAWSGlue30_node1 = (
glueContext.create_dynamic_frame.from_options(
connection_type="marketplace.spark",
connection_options={
"parentProject": "...",
"table": "...",
"connectionName": "...",
"filter": "date = 'yyyy-mm-dd'" #put condition here
},
transformation_ctx="GoogleBigQueryConnector0220forAWSGlue30_node1",
)
)

Accessing datacatalog table in Glue properly

I created a table in Athena without a crawler from S3 source. It is showing up in my datacatalog. However, when I try to access it through a python job in Glue ETL, it shows that it has no column or any data. The following error pops up when accessing a column: AttributeError: 'DataFrame' object has no attribute '<COLUMN-NAME>'.
I am trying to access the dynamic frame following the glue way:
datasource = glueContext.create_dynamic_frame.from_catalog(
database="datacatalog_database",
table_name="table_name",
transformation_ctx="datasource"
)
print(f"Count: {datasource.count()}")
print(f"Schema: {datasource.schema()}")
The above logs output: Count: 0 & Schema: StructType([], {}), where the Athena table shows I have around ~800,000 rows.
Sidenotes:
The ETL job concerned has AWSGlueServiceRole attached.
I tried Glue Visual Editor as well, it showed the datacatalog database/table concerned but sadly, same error.
It looks like the S3 bucket has multiple nested folders inside it. For Glue to read these folders you need to add a flag adding additional_options = {"recurse": True} to your from_catalog(). This will help to recursively read records from s3 files.

Load partitioned json files from S3 in AWS Glue ETL jobs

I'm trying to load json files that are partitioned like this in an S3 storage :
|-json-data
|-x=something
|-y=something
|-data.json
I'm loading them like this in my ETL job
datasource0 = glueContext.create_dynamic_frame_from_options('s3',
{
'paths': ['s3://bucket/json-data/'],
'recurse': True,
'groupFiles': 'inPartition',
'partitionKeys':['x', 'y']
},
format='json',
transformation_ctx = 'datasource0')
However when I try to read the schema using datasource0.printSchema() I don't have any partition in the schema. I need to have those partitions in the schema to do the transformations. After some research I'm not sure if this is a supported feature of create_dynamic_frame_from_options. Does someone know how to do this ?
You can only pass partitionKeys in write_dynamic_frame.from_options and not while reading from s3.For you to load specific partitions or filter them you need these partitions to be present already in source.
So you need to either crawl using Glue crawler or create table in Athena with partitions. Once the table is available in Glue metadata you can load the table's partitions in to Glue ETL as shown below:
glue_context.create_dynamic_frame.from_catalog(
database = "my_S3_data_set",
table_name = "catalog_data_table",
push_down_predicate = my_partition_predicate)
Please refer to below link on how you can leverage predicate pushdown:
https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-partitions.html

AWS Glue with Athena

We are in a phase where we are migrating all of our spark job written in scala to aws glue.
Current Flow:
Apache Hive -> Spark(Processing/Transformation) -> Apache Hive -> BI
Required Flow:
AWS S3(Athena) -> Aws Glue(Spark Scala -> Processing/Transformation) -> AWS S3 -> Athena -> BI
TBH i got this task yesterday and i am doing R&D on it. My questions are :
Can we run same code in apache glue as it has dynamic frame which
can be converted to dataframes but require changes in code.
Can we read data from aws athena using spark sql api in aws glue
like we normally do in spark.
Aws glue extends the capabilities of Apache Spark. Hence you can always use your code as it is.
The only changes you need to do is to change the creation of session variable and parsing of arguments provided. You can run plain old pyspark code without even creating dynamic frames.
def createSession():
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
return sc, glueContext, spark, job
#To handle the arguments
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'arg1', 'arg2'])
arg1 = args['arg1'].split(',')
arg2 = args['arg2'].strip()
#To initialize the job
job.init(args['JOB_NAME'], args)
#your code here
job.commit()
And it also supports spark sql over glue catalog.
Hope it helps
I am able to run my current code with minor changes.
i have built sparkSession and use that session to query glue hive enabled catalog table.
we need to add this parameter in our job --enable-glue-datacatalog
SparkSession.builder().appName("SPARK-DEVELOPMENT").getOrCreate()
var sqlContext = a.sqlContext
sqlContext.sql("use default")
sqlContext.sql("select * from testhive").show()