AWS update Athena meta: Glue Crawler vs MSCK Repair Table

AWS update Athena meta: Glue Crawler vs MSCK Repair Table - amazon-web-services

When new partition is added to an Athena table, we could use either Glue Crawler or MSCK REPAIR TABLE to update meta info. What are the cost for them? Which one is preferred?

MSCK REPAIR TABLE command requires your S3 key to include the partition scheme as documented here. If your S3 key does not include the partition scheme, the MSCK REPAIR TABLE command will return missing partitions, but you will still have to add them in. Also one other difference is that the MSCK REPAIR TABLE command can time out after 30 minutes (default Athena query time length) while glue crawler will not.
Here is pricing information:
Glue Crawler:
There is an hourly rate for AWS Glue crawler runtime to discover data and populate the AWS Glue Data Catalog. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your crawler. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. You are billed in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each crawl. Use of AWS Glue crawlers is optional, and you can populate the AWS Glue Data Catalog directly through the API.
Pricing
For all AWS Regions where AWS Glue is available:
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
Athena:
There are no charges for Data Definition Language (DDL) statements like CREATE/ALTER/DROP TABLE, statements for managing partitions, or failed queries.
However, on top of both of these commands you will still incur S3 costs. Reference: AWS Athena: does `msck repair table` incur costs?
My opinion is it is best to manage the partition yourself if you are able to, after adding new data.
'ALTER TABLE database.table ADD
PARTITION (partition_name='PartitionValue') location 's3://bucket/path/partition'
If forced to use Glue or Athena, I would evaluate which way will fit better into your process. The MSCK REPAIR TABLE command may be easier to manage but you may run into trouble if you have a lot of data in partitions or they are not partitioned correctly. Also, you will have to have a way to automate running the command. Glue Crawlers can be configured with triggers.

I agree with adding partitions manually. You can do this via an Athena query (ALTER TABLE ... ADD PARTITION () ...) as in the answer from #KiteCoder, or you can do this via the Glue API directly.
Calling the Glue API is more verbose, but also more 'structured'. Calling Athena is obviously a SQL query, and I know how many people despise writing code that dynamically generates SQL.
The specific operation is CreatePartition. It does require an object called StorageDescriptor which defines all the columns and data types in that table, but for an existing table you can retrieve that structure from the GetTable operation.

Related

AWS Athena tables automatically appear in AWS Glue console

I recently found out that there's a restriction on the number of partitions that AWS Athena table may have (20000 at the moment, mentioned here: https://docs.aws.amazon.com/athena/latest/ug/partitions.html).
The same page mentions that AWS Glue tables may have 10 million partitions, so I opened my AWS Glue console to recreate the tables that I had been using in Athena so far, and was surprised to see all the tables that I created in Athena console being listed in AWS Glue console as well.
Hence a question, does that mean every table created in Athena console is going to be an AWS Glue table and is going to support 10 million partitions?
I am currently using Athena SDK for Java (https://docs.aws.amazon.com/athena/latest/ug/code-samples.html) to select and load data from table t1 into table t2 using INSERT INTO queries which dynamically generate partitions in Hive format (i.e. col1=<...>/col2=<...>/...). Can I still use it? Is there any other SDK specifically for Glue tables?
My current concern is table t2: it's going to reach 20000 partitions limit quite soon so I'm wondering if I still need to worry about that or not?
And in case if the fact of being listed in AWS Glue console does not yet imply supporting 10M partitions, then how to make existing Athena table support 10M partitions? Should the table be created in AWS Glue console using "Add table" in order to have 10M partition support?

Yes and no. If you are using the Glue data catalog to query Athena (by default, you are), then Athena supports querying tables with 10m partitions. However, it can only actually use 1m of those partitions at a time. source

AWS Glue Crawler query

I have a few AWS Glue crawlers setup to crawl CSV's in S3 to populate my tables in Athena.
My scenario and question:
I replace the .csv files in S3 daily with updated versions. Do I have to run the existing crawlers again perhaps on a schedule to update the tables on Athena with the latest content? Or is the crawler only required to run if schema changes such as additional columns added? I just want to ensure that my tables in Athena always output all of the data as per the updated CSV's - I rarely do any schema changes to the table structures. If the crawlers are only required to run when actual structure changes take place then I would prefer to run them a lot less frequently

When a glue crawler runs, the following actions take place:
It classifies data to determine the format, schema, and associated properties of the raw data
Groups data into tables or partitions
Writes metadata to the Data Catalog
The schema of tables created in the Data Catalog is referenced by Athena to query the specified S3 datasource. So, if the schema remains constant, scheduling the crawler runs can be reduced.
You can also refer the documentation here to understand working with glue crawlers and csv files in Athena: https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html

Smart sampling with AWS Glue Crawlers

I have a couple of tables on my s3 bucket. The tables are big both in memory size and in the amount of files, they are stored in JSON(suboptimal, I know) and have a lot of partitions.
Now I want to enable AWS Glue Data Catalog and AWS Glue Crawlers, however I am terrified by the price of the crawlers going through all of the data.
The schema doesn't change often so it is not necessary to go through all of the files on S3.
Will the Crawlers go through all the files by default? Is it possible to configure a smarter sampling strategy that would look inside just some of the files instead of all of them?

Depending on your bucket structure maybe you could just make use of exclude paths and point the crawlers to specific prefixes that you want to be crawled. If the partitioning is hive style partitioning then you can make use of Athena to execute msck repair table to add partitions. Alternatively you can create the tables manually in Athena and run msck repair which is bound to take a very long time if you have to many partitions and files are huge as you mentioned.

AWS Glue Crawlers and large tables stored in S3

I have some general question about AWS Glue and its crawlers. I have some data streams into S3 buckets and I use AWS Athena to access them as external tables in redshift.
The tables are partitioned by hour, some glue crawlers update the partitions and the table structure every hour.
The Problem is that the crawlers take longer and longer and someday they will not finish in less than an hour.
Is there some setting in order to speed up this process or some proper alternative to the crawlers in AWS Glue?

Unfortunately there are not config options for Glue Crawlers to tune performance. However, as far as I know AWS Glue team should release a feature that improves performance of crawlers significantly (don't know the date though).
In general, there are few ways to register new partitions in Data Catalog:
Run a Glue Crawler
Run MSCK REPAIR TABLE <table> Athena query
Add partition via Athena
Add partition via Glue API
The most efficient way is to add partition manually (3 or 4). So if you know when and which new partitions should be registered then you can setup a lambda function to call Athena or a Glue API. The lambda itself might be triggered by SNS or CloudWatch event.

Copying only new records from AWS DynamoDB to AWS Redshift

I see there is tons of examples and documentation to copy data from DynamoDB to Redshift, but we are looking at an incremental copy process where only the new rows are copied from DynamoDB to Redshift. We will run this copy process everyday, so there is no need to kill the entire redshift table each day. Does anybody have any experience or thoughts on this topic?

Dynamo DB has a feature (currently in preview) called Streams:
Amazon DynamoDB Streams maintains a time ordered sequence of item
level changes in any DynamoDB table in a log for a duration of 24
hours. Using the Streams APIs, developers can query the updates,
receive the item level data before and after the changes, and use it
to build creative extensions to their applications built on top of
DynamoDB.
This feature will allow you to process new updates as they come in and do what you want with them, rather than design an exporting system on top of DynamoDB.
You can see more information about how the processing works in the Reading and Processing DynamoDB Streams documentation.

The copy from redshift can only copy the entire table. There are several ways to achieve this
Using an AWS EMR cluster and Hive - If you set up an EMR cluster then you can use Hive tables to execute queries on the dynamodb data and move to S3. Then that data can be easily moved to redshift.
You can store your dynamodb data based on access patterns (see http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns). If we store the data this way, then the dynamodb tables can be dropped after they are copied to redshift

This can be solved with a secondary DynamoDB table that tracks only the keys that were changed since the last backup. This table has to be updated wherever initial DynamoDB table is updated (add, update, delete). At the end of a backup process you will delete them or after you backup a row (one by one).

If your DynamoDB table can have
Timestamps as an attribute or
A binary flag which conveys data freshness as attribute
then you can write a hive query to export only current day's data or fresh data to s3 and then 'KEEP_EXISTING' copy this incremental s3 data to Redshift.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js