AWS Glue Crawlers and large tables stored in S3 - amazon-web-services

I have some general question about AWS Glue and its crawlers. I have some data streams into S3 buckets and I use AWS Athena to access them as external tables in redshift.
The tables are partitioned by hour, some glue crawlers update the partitions and the table structure every hour.
The Problem is that the crawlers take longer and longer and someday they will not finish in less than an hour.
Is there some setting in order to speed up this process or some proper alternative to the crawlers in AWS Glue?

Unfortunately there are not config options for Glue Crawlers to tune performance. However, as far as I know AWS Glue team should release a feature that improves performance of crawlers significantly (don't know the date though).
In general, there are few ways to register new partitions in Data Catalog:
Run a Glue Crawler
Run MSCK REPAIR TABLE <table> Athena query
Add partition via Athena
Add partition via Glue API
The most efficient way is to add partition manually (3 or 4). So if you know when and which new partitions should be registered then you can setup a lambda function to call Athena or a Glue API. The lambda itself might be triggered by SNS or CloudWatch event.

Related

AWS Athena tables automatically appear in AWS Glue console

I recently found out that there's a restriction on the number of partitions that AWS Athena table may have (20000 at the moment, mentioned here: https://docs.aws.amazon.com/athena/latest/ug/partitions.html).
The same page mentions that AWS Glue tables may have 10 million partitions, so I opened my AWS Glue console to recreate the tables that I had been using in Athena so far, and was surprised to see all the tables that I created in Athena console being listed in AWS Glue console as well.
Hence a question, does that mean every table created in Athena console is going to be an AWS Glue table and is going to support 10 million partitions?
I am currently using Athena SDK for Java (https://docs.aws.amazon.com/athena/latest/ug/code-samples.html) to select and load data from table t1 into table t2 using INSERT INTO queries which dynamically generate partitions in Hive format (i.e. col1=<...>/col2=<...>/...). Can I still use it? Is there any other SDK specifically for Glue tables?
My current concern is table t2: it's going to reach 20000 partitions limit quite soon so I'm wondering if I still need to worry about that or not?
And in case if the fact of being listed in AWS Glue console does not yet imply supporting 10M partitions, then how to make existing Athena table support 10M partitions? Should the table be created in AWS Glue console using "Add table" in order to have 10M partition support?
Yes and no. If you are using the Glue data catalog to query Athena (by default, you are), then Athena supports querying tables with 10m partitions. However, it can only actually use 1m of those partitions at a time. source

AWS glue job (Pyspark) to AWS glue data catalog

We know that,
the procedure of writing from pyspark script (aws glue job) to AWS data catalog is to write in s3 bucket (eg.csv) use a crawler and schedule it.
Is there any other way of writing to aws glue data catalog?
I am looking for a direct way to do this.Eg. writing as a s3 file and sync to the aws glue data catalog.
You may manually specify the table. The crawler only discovers the schema. If you set the schema manually, you should be able to read your data when you run the AWS Glue Job.
We have had this same problem for one of our customers who had millions of small files within AWS S3. The crawler practically would stall and not proceed and continue to run infinitely. We came up with the following alternative approach :
A Custom Glue Python Shell job was written which leveraged AWS Wrangler to fire queries towards AWS Athena.
The Python Shell job would List the contents of folder s3:///event_date=<Put the Date Here from #2.1>
The queries fired :
alter table add partition (event_date='<event_date from above>',eventname=’List derived from above S3 List output’)
4. This was triggered to run post the main Ingestion Job via Glue Workflows.
If you are not expecting schema to change, use Glue job directly after creating manually tables using Glue Database and Table.

AWS Glue - Tracking Processed Data on DocumentDB

I have a DocumentDB as the data source.
I am running an AWS Glue job that pulls all the data from a certain table, and then inserts it to a RedShift cluster.
Is it possible to avoid adding duplicate data?
I have seen that AWS glue supports bookmarks,
This does not seem to work for DocumentDB as the data source
Thanks.

How does crawler much better than direct connecting to db and retreive data?

In AWS Glue jobs, in order to retrieve data from DB or S3, we can get using 2 approaches. 1) Using Crawler 2) Using direct connection to DB or S3.
So my question is: How does crawler much better than direct connecting to a database and retrieve data?
AWS Glue Crawlers will not retrieve the actual data. Crawlers accesses your data stores and progresses through a prioritized list of classifiers to extract the schema of your data and other statistics, and then populates the Glue Data Catalog with this metadata. Crawlers can be scheduled to run periodically that will detect the availability of the new data along with the change to the existing data, including the table definition changes made by the data crawler. Crawlers automatically adds new table, new partitions to the existing table and the new versions of table definitions.
AWS Glue Data Catalog becomes a common metadata repository between
Amazon Athena, Amazon Redshift Spectrum, Amazon S3. AWS Glue Crawlers
helps in building this metadata repository.

AWS update Athena meta: Glue Crawler vs MSCK Repair Table

When new partition is added to an Athena table, we could use either Glue Crawler or MSCK REPAIR TABLE to update meta info. What are the cost for them? Which one is preferred?
MSCK REPAIR TABLE command requires your S3 key to include the partition scheme as documented here. If your S3 key does not include the partition scheme, the MSCK REPAIR TABLE command will return missing partitions, but you will still have to add them in. Also one other difference is that the MSCK REPAIR TABLE command can time out after 30 minutes (default Athena query time length) while glue crawler will not.
Here is pricing information:
Glue Crawler:
There is an hourly rate for AWS Glue crawler runtime to discover data and populate the AWS Glue Data Catalog. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your crawler. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. You are billed in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each crawl. Use of AWS Glue crawlers is optional, and you can populate the AWS Glue Data Catalog directly through the API.
Pricing
For all AWS Regions where AWS Glue is available:
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
Athena:
There are no charges for Data Definition Language (DDL) statements like CREATE/ALTER/DROP TABLE, statements for managing partitions, or failed queries.
However, on top of both of these commands you will still incur S3 costs. Reference: AWS Athena: does `msck repair table` incur costs?
My opinion is it is best to manage the partition yourself if you are able to, after adding new data.
'ALTER TABLE database.table ADD
PARTITION (partition_name='PartitionValue') location 's3://bucket/path/partition'
If forced to use Glue or Athena, I would evaluate which way will fit better into your process. The MSCK REPAIR TABLE command may be easier to manage but you may run into trouble if you have a lot of data in partitions or they are not partitioned correctly. Also, you will have to have a way to automate running the command. Glue Crawlers can be configured with triggers.
I agree with adding partitions manually. You can do this via an Athena query (ALTER TABLE ... ADD PARTITION () ...) as in the answer from #KiteCoder, or you can do this via the Glue API directly.
Calling the Glue API is more verbose, but also more 'structured'. Calling Athena is obviously a SQL query, and I know how many people despise writing code that dynamically generates SQL.
The specific operation is CreatePartition. It does require an object called StorageDescriptor which defines all the columns and data types in that table, but for an existing table you can retrieve that structure from the GetTable operation.