AWS Glue Crawler updating existing catalog tables is (painfully) slow - amazon-web-services

I am continuously receiving and storing multiple feeds of uncompressed JSON objects, partitioned to the day, to different locations of an Amazon S3 bucket (hive-style: s3://bucket/object=<object>/year=<year>/month=<month>/day=<day>/object_001.json), and was planning to incrementally batch and load this data to a Parquet data lake using AWS Glue:
Crawlers would update manually created Glue tables, one per object feed, for schema and partition (new files) updates;
Glue ETL Jobs + Job bookmarking would then batch and map all new partitions per object feed to a Parquet location now and then.
This design pattern & architecture seemed to be quite a safe approach as it was backed up by many AWS blogs, here and there.
I have a crawler configured as so:
{
"Name": "my-json-crawler",
"Targets": {
"CatalogTargets": [
{
"DatabaseName": "my-json-db",
"Tables": [
"some-partitionned-json-in-s3-1",
"some-partitionned-json-in-s3-2",
...
]
}
]
},
"SchemaChangePolicy": {
"UpdateBehavior": "UPDATE_IN_DATABASE",
"DeleteBehavior": "LOG"
},
"Configuration": "{\"Version\":1.0,\"Grouping\":{\"TableGroupingPolicy\":\"CombineCompatibleSchemas\"}}"
}
And each table was "manually" initialized as so:
{
"Name": "some-partitionned-json-in-s3-1",
"DatabaseName": "my-json-db",
"StorageDescriptor": {
"Columns": [] # i'd like the crawler to figure this out on his first crawl,
"Location": "s3://bucket/object=some-partitionned-json-in-s3-1/",
"PartitionKeys": [
{
"Name": "year",
"Type": "string"
},
{
"Name": "month",
"Type": "string"
},
{
"Name": "day",
"Type": "string"
}
],
"TableType": "EXTERNAL_TABLE"
}
}
First run of the crawler is, as expected, an hour-ish long, but it successfully figures out the table schema and existing partitions. Yet from that point onward, re-running the crawler takes the exact same amount of time as the first crawl, if not longer; which lead me to believe that the crawler is not only crawling for new files / partitions, but recrawling all the entire S3 locations each time.
Note that the delta of new files between two crawls is very small (few new files are to be expected each time).
AWS Documentation suggests running multiple crawlers, but I am not convinced that this would solve my problem on the long run. I also considered updating the crawler exclude pattern after each run, but then I would see too few advantages using Crawlers over manually updating Tables partitions through some Lambda boto3 magic.
Am I missing something there ? Maybe an option I would have misunderstood regarding crawlers updating existing data catalogs rather than crawling data stores directly ?
Any suggestions to improve my data cataloging ? Given that indexing these JSON files in Glue tables is only necessary to me as I want my Glue Job to use bookmarking.
Thanks !

AWS Glue Crawlers now support Amazon S3 event notifications natively, to solve this exact problem.
See the blog post.

Still getting some hits on this unanswered question of mine, so I wanted to share a solution I found adequate at the time: I ended up not using crawlers, at all to incrementally update my Glue tables.
Using S3 Events / S3 Api Calls via CloudTrail / S3 Eventbridge notifications (pick one), ended up writing a lambda which pops a ALTER TABLE ADD PARTITION DDL query on Athena, updating an already existing Glue table with the newly created partition, based on the S3 key prefix. This is a pretty straight-forward and low-code approach to maintaining Glue tables in my opinion; the only downside being handling service throttling (both Lambda and Athena), and failing queries to avoid any loss of data in the process.
This solution scales up pretty well though, as parallel DDL queries per account is a soft-limit quota that can be increased as your need for updating more and more tables increases; and works pretty well for non-time-critical workflows.
Works even better if you limit S3 writes to your Glue tables S3 partitions (one file per Glue table partition is ideal in this particular implementation) by batching your data, using a Kinesis DeliveryStream for example.

Related

Can I write custom query in Google BigQuery Connector for AWS Glue?

I'm creating a Glue ETL job that transfers data from BigQuery to S3. Similar to this example, but with my own dataset.
n.b.: I use BigQuery Connector for AWS Glue v0.22.0-2 (link).
The data in BigQuery is already partitioned by date, and I would like to have every Glue job run fetches a specific date only (WHERE date = ...) and group them into 1 CSV file output. But I don't find any clue where to insert the custom WHERE query.
In BigQuery source node configuration options, the options are only these:
Also in the generated script, it uses create_dynamic_frame.from_options which does not accommodate custom query (per documentation).
# Script generated for node Google BigQuery Connector 0.22.0 for AWS Glue 3.0
GoogleBigQueryConnector0220forAWSGlue30_node1 = (
glueContext.create_dynamic_frame.from_options(
connection_type="marketplace.spark",
connection_options={
"parentProject": args["BQ_PROJECT"],
"table": args["BQ_TABLE"],
"connectionName": args["BQ_CONNECTION_NAME"],
},
transformation_ctx="GoogleBigQueryConnector0220forAWSGlue30_node1",
)
)
So, is there any way I can write a custom query? Or is there any alternative method?
Quoting this AWS sample project, we can use filter in Connection Options:
filter – Passes the condition to select the rows to convert. If the table is partitioned, the selection is pushed down and only the rows in the specified partition are transferred to AWS Glue. In all other cases, all data is scanned and the filter is applied in AWS Glue Spark processing, but it still helps limit the amount of memory used in total.
Example if used in script:
# Script generated for node Google BigQuery Connector 0.22.0 for AWS Glue 3.0
GoogleBigQueryConnector0220forAWSGlue30_node1 = (
glueContext.create_dynamic_frame.from_options(
connection_type="marketplace.spark",
connection_options={
"parentProject": "...",
"table": "...",
"connectionName": "...",
"filter": "date = 'yyyy-mm-dd'" #put condition here
},
transformation_ctx="GoogleBigQueryConnector0220forAWSGlue30_node1",
)
)

AWS Glue Pyspark Parquet write to S3 taking too long

I have a AWS glue job (PySpark) that needs to load data from a centralized data lake of size 350GB+, prepare it and load into a s3 bucket partitioned by two columns. I noticed that it takes really a long time (around a day even) just to load and write one week of data. There are months of data that needs to be written. I tried increasing the worker nodes but it does not seem to fix the problem.
My glue job currently has 60 G.1x worker nodes.
My SparkConf in the code looks like this
conf = pyspark.SparkConf().setAll([
("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", "2"),
("spark.speculation", "false"),
("spark.sql.parquet.enableVectorizedReader", "false"),
("spark.sql.parquet.mergeSchema", "true"),
("spark.sql.crossJoin.enabled", "true"),
("spark.sql.sources.partitionOverwriteMode","dynamic"),
("spark.hadoop.fs.s3.maxRetries", "20"),
("spark.hadoop.fs.s3a.multiobjectdelete.enable", "false")
])
I believe it does succeed in writing the files into partitions, however it is taking really long to delete all the temporary spark-staging files it created. When I checked the tasks, this seems to take most of the time.
2021-04-22 03:08:50,558 INFO [Thread-6] s3n.S3NativeFileSystem (S3NativeFileSystem.java:rename(1355)): rename s3://<bucket-name>/etl/sessions/.spark-staging-8df58afd-d6b2-4ca0-8611-429125abe2ae/p_date=2020-12-16/geo=u1hm s3://<bucket-name>/etl/sessions/p_date=2020-12-16/geo=u1hm
My write to S3 looks like this
finalDF.coalesce(50).write.partitionBy('p_date','geohash').save("s3://{bucket}/{basepath}/{table}/".format(bucket=args['DataBucket'], basepath='etl',
table='sessions'), format='parquet', mode="overwrite")
Any help would be appericiated.

Difficulty loading large dataframes from S3 jsonl data into glue for conversion to parquet: Memory Constraints and failed worker spawning

I am attempting to load large datasets from S3 in JSONL format using AWS glue. The S3 data is accessed through a glue table projection. Once they are loaded, I save them back to a different S3 location in Parquet format. For the most part this strategy works, but for some of my datasets, the glue job runs out of memory. On closer inspection, it would seem it is trying to load the entire large dataset onto one executor before redistributing it.
I have tried upgrading the worker size to G.1X from standard so that it would have more memory and CPU, and this did not work for me: the script would still crash.
It was recommended I try the techniques for partitioning input data outlined on this page:
https://docs.aws.amazon.com/glue/latest/dg/grouping-input-files.html
I tried to use the groupSize and groupFiles parameters, but these did not work for me: the script would still crash.
Finally, I was recommended to set --config options like in a typical spark environment so that my program could use more memory, but as I am working in AWS I am not able to set the config.
The following code demonstrates what I'm attempting to do:
datasource = glueContext.create_dynamic_frame.from_catalog(
database=SOURCE_GLUE_DATABASE,
table_name=SOURCE_TABLE
)
# yapf: disable
glueContext.write_dynamic_frame.from_options(
frame=datasource,
connection_type="s3",
connection_options={
"path": OUTPUT_PATH,
},
format="parquet"
)
The expected result ( and the one I see for most of my datasets ) is that parquet data is written out to the new S3 location. In the worst cases, looking through the logs reveals that only one worker is being used, despite having set the maximum capacity setting for the job to 10 or more. It seems that it just doesn't want to create more workers, and I can't understand why.

Automate aws Athena partition loading [duplicate]

I have a Spark batch job which is executed hourly. Each run generates and stores new data in S3 with the directory naming pattern DATA/YEAR=?/MONTH=?/DATE=?/datafile.
After uploading the data to S3, I want to investigate it using Athena. Also, I would like to visualize them in QuickSight by connecting to Athena as a data source.
The problem is that after each run of my Spark batch, the newly generated data stored in S3 will not be discovered by Athena, unless I manually run the query MSCK REPAIR TABLE.
Is there a way to make Athena update the data automatically, so that I can create a fully automatic data visualization pipeline?
There are a number of ways to schedule this task. How do you schedule your workflows? Do you use a system like Airflow, Luigi, Azkaban, cron, or using an AWS Data pipeline?
From any of these, you should be able to fire off the following CLI command.
$ aws athena start-query-execution --query-string "MSCK REPAIR TABLE some_database.some_table" --result-configuration "OutputLocation=s3://SOMEPLACE"
Another option would be AWS Lambda. You could have a function that calls MSCK REPAIR TABLE some_database.some_table in response to a new upload to S3.
An example Lambda Function could be written as such:
import boto3
def lambda_handler(event, context):
bucket_name = 'some_bucket'
client = boto3.client('athena')
config = {
'OutputLocation': 's3://' + bucket_name + '/',
'EncryptionConfiguration': {'EncryptionOption': 'SSE_S3'}
}
# Query Execution Parameters
sql = 'MSCK REPAIR TABLE some_database.some_table'
context = {'Database': 'some_database'}
client.start_query_execution(QueryString = sql,
QueryExecutionContext = context,
ResultConfiguration = config)
You would then configure a trigger to execute your Lambda function when new data are added under the DATA/ prefix in your bucket.
Ultimately, explicitly rebuilding the partitions after you run your Spark Job using a job scheduler has the advantage of being self documenting. On the other hand, AWS Lambda is convenient for jobs like this one.
You should be running ADD PARTITION instead:
aws athena start-query-execution --query-string "ALTER TABLE ADD PARTITION..."
Which adds a the newly created partition from your S3 location
Athena leverages Hive for partitioning data.
To create a table with partitions, you must define it during the CREATE TABLE statement. Use PARTITIONED BY to define the keys by which to partition data.
There's multiple ways to solve the issue and get the table updated:
Call MSCK REPAIR TABLE. This will scan ALL data. It's costly as every file is read in full (at least it's fully charged by AWS). Also it's painfully slow. In short: Don't do it!
Create partitions by your own by calling ALTER TABLE ADD PARTITION abc .... This is good in a sense no data is scanned and costs are low. Also the query is fast, so no problems here. It's also a good choice if you have very cluttered file structure without any common pattern (which doesn't seem it's your case as it's a nicely organised S3 key pattern). There's also downsides to this approach: A) It's hard to maintain B) All partitions will to be stored in GLUE catalog. This can become an issue when you have a lot of partitions as they need to be read out and passed to Athena and EMRs Hadoop infrastructure.
Use partition projection. There's two different styles you might want to evaluate. Here's the variant with does create the partitions for Hadoop at query time. This means there's no GLUE catalog entries send over the network and thus large amounts of partitions can be handled quicker. The downside is you might 'hit' some partitions that might not exist. These will of course be ignored, but internally all partitions that COULD match your query will be generated - no matter if they are on S3 or not (so always add partition filters to your query!). If done correctly, this option is a fire and forget approach as there's no updates needed.
CREATE EXTERNAL TABLE `mydb`.`mytable`
(
...
)
PARTITIONED BY (
`YEAR` int,
`MONTH` int,
`DATE` int)
...
LOCATION
's3://DATA/'
TBLPROPERTIES(
"projection.enabled" = "true",
"projection.account.type" = "integer",
"projection.account.range" = "1,50",
"projection.YEAR.type" = "integer",
"projection.YEAR.range" = "2020,2025",
"projection.MONTH.type" = "integer",
"projection.MONTH.range" = "1,12",
"projection.DATE.type" = "integer",
"projection.DATE.range" = "1,31",
"storage.location.template" = "s3://DATA/YEAR=${YEAR}/MONTH=${MONTH}/DATE=${DATE}/"
);
https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html
Just to list all options: You can also use GLUE crawlers. But it doesn't seemed to be a favourable approach as it's not as flexible as advertised.
You get more control on GLUE using Glue Data Catalog API directly, which might be an alternative to approach #2 if you have a lot of automated scripts
that do the preparation work to setup your table.
In short:
If your application is SQL centric, you like the leanest approach with no scripts, use partition projection
If you have many partitions, use partition projection
If you have a few partitions or partitions do not have a generic pattern, use approach #2
If you're script heavy and scripts do most of the work anyway and are easier to handle for you, consider approach #5
If you're confused and have no clue where to start - try partition projection first! It should fit 95% of the use cases.

How to create AWS Glue table where partitions have different columns? ('HIVE_PARTITION_SCHEMA_MISMATCH')

As per this AWS Forum Thread, does anyone know how to use AWS Glue to create an AWS Athena table whose partitions contain different schemas (in this case different subsets of columns from the table schema)?
At the moment, when I run the crawler over this data and then make a query in Athena, I get the error 'HIVE_PARTITION_SCHEMA_MISMATCH'
My use case is:
Partitions represent days
Files represent events
Each event is a json blob in a single s3 file
An event contains a subset of columns (dependent on the type of event)
The 'schema' of the entire table is the full set of columns for all the event types (this is correctly put together by Glue crawler)
The 'schema' of each partition is the subset of columns for the event types that occurred on that day (hence in Glue each partition potentially has a different subset of columns from the table schema)
This inconsistency causes the error in Athena I think
If I were to manually write a schema I could do this fine as there would just be one table schema, and keys which are missing in the JSON file would be treated as Nulls.
Thanks in advance!
I had the same issue, solved it by configuring crawler to update table metadata for preexisting partitions:
It also fixed my issue!
If somebody need to provision This Configuration Crawler with Terraform so here is how I did it:
resource "aws_glue_crawler" "crawler-s3-rawdata" {
database_name = "my_glue_database"
name = "my_crawler"
role = "my_iam_role.arn"
configuration = <<EOF
{
"Version": 1.0,
"CrawlerOutput": {
"Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
}
}
EOF
s3_target {
path = "s3://mybucket"
}
}
This helped me. Posting the image for others in case the link is lost
Despite selecting Update all new and existing partitions with metadata from the table. in the crawler's configuration, it still occasionally failed to set the expected parameters for all partitions (specifically jsonPath wasn't inherited from the table's properties in my case).
As suggested in https://docs.aws.amazon.com/athena/latest/ug/updates-and-partitions.html, "to drop the partition that is causing the error and recreate it" helped
After dropping the problematic partitions, glue crawler re-created them correctly on the following run