Automatic partition addition to hive metastore in a Presto setup

Automatic partition addition to hive metastore in a Presto setup - amazon-athena

Source: S3
Query Engine: Athena
In our use case, several partitions (and hence, files) are added to S3 continuously and the partitions are made available immediately via dynamic partition projection. So the cost of adding partitions is not high.
Now, I am porting this to GCP GCS and Presto with Hive metastore (Trino distribution, to be exact). (I haven't tried DataProc yet, both Presto and Hive metastore will be self managed)
I am looking for something close to dynamic partition projection, the way it happens in Athena. However, I haven't found anything automatic. It is either manually adding partitions or repairing the entire table via the procedure system.sync_partition_metadata
Is there anything less expensive in Presto that'd help me in adding the partitions to the metastore without requiring a full scan?

Related

AWS Athena tables automatically appear in AWS Glue console

I recently found out that there's a restriction on the number of partitions that AWS Athena table may have (20000 at the moment, mentioned here: https://docs.aws.amazon.com/athena/latest/ug/partitions.html).
The same page mentions that AWS Glue tables may have 10 million partitions, so I opened my AWS Glue console to recreate the tables that I had been using in Athena so far, and was surprised to see all the tables that I created in Athena console being listed in AWS Glue console as well.
Hence a question, does that mean every table created in Athena console is going to be an AWS Glue table and is going to support 10 million partitions?
I am currently using Athena SDK for Java (https://docs.aws.amazon.com/athena/latest/ug/code-samples.html) to select and load data from table t1 into table t2 using INSERT INTO queries which dynamically generate partitions in Hive format (i.e. col1=<...>/col2=<...>/...). Can I still use it? Is there any other SDK specifically for Glue tables?
My current concern is table t2: it's going to reach 20000 partitions limit quite soon so I'm wondering if I still need to worry about that or not?
And in case if the fact of being listed in AWS Glue console does not yet imply supporting 10M partitions, then how to make existing Athena table support 10M partitions? Should the table be created in AWS Glue console using "Add table" in order to have 10M partition support?

Yes and no. If you are using the Glue data catalog to query Athena (by default, you are), then Athena supports querying tables with 10m partitions. However, it can only actually use 1m of those partitions at a time. source

What are some good strategies/applications for viewing/analyzing metrics I have stored in an S3 bucket?

I have an S3 bucket full of plaintext metrics and want a way in which to analyze and view this data. One option I am considering is Amazon Athena, but I would like to consider the pros and cons of a few approaches.

Amazon Athena is really good for adhoc analysis. If your file is in a format as supported by Athena and if you want to run just few adhoc analysis. You can quickly get started with Athena.
If you want to make your adhoc analysis faster, create an external table over your existing files, consider running a Athena CTAS query to transform your data to Avro / Parquet and partition / bucket your data as necessary.
If cost is not an issue, you can also look in to Redshift. See if its possible to execute redshift copy command on your files to import all these files in a Redshift DB. Use the appropriate sort keys and distribution keys to improve your query performance in Redshift.

Add location dynamically in Amazon Redshift create table statement

I am trying to create external table in Amazon Redshift using statement
mentioned at this link.
In my case I want location To be parameterized instead of static value
I am using dB Weaver for Amazon redshift

If your partitions are hive compatible(<partition_column_name>=<partition_column_value>) and your table is defined via Glue or Athena, then you can run MSCK REPAIR TABLE on the Athena table directly, which would add them. Read this thread for more info: https://forums.aws.amazon.com/thread.jspa?messageID=800945
You can also try using partition projections, if you don't use hive compatible partitions, where you define the structure of the files location in relation to the partitions and parameters.
If those don't work with you, you can use AWS Glue Crawlers which supposedly automatically detect partitions: https://docs.aws.amazon.com/glue/latest/dg/add-crawler.html
If that doesn't work for you, well then your problem is very specific. I suggest pulling up your sleeves and write some code, deploy on Lambda or AWS Glue Python Shell Job. Here's a bunch of examples where other people tried that:
https://medium.com/swlh/add-newly-created-partitions-programmatically-into-aws-athena-schema-d773722a228e
https://medium.com/#alsmola/partitioning-cloudtrail-logs-in-athena-29add93ee070

Smart sampling with AWS Glue Crawlers

I have a couple of tables on my s3 bucket. The tables are big both in memory size and in the amount of files, they are stored in JSON(suboptimal, I know) and have a lot of partitions.
Now I want to enable AWS Glue Data Catalog and AWS Glue Crawlers, however I am terrified by the price of the crawlers going through all of the data.
The schema doesn't change often so it is not necessary to go through all of the files on S3.
Will the Crawlers go through all the files by default? Is it possible to configure a smarter sampling strategy that would look inside just some of the files instead of all of them?

Depending on your bucket structure maybe you could just make use of exclude paths and point the crawlers to specific prefixes that you want to be crawled. If the partitioning is hive style partitioning then you can make use of Athena to execute msck repair table to add partitions. Alternatively you can create the tables manually in Athena and run msck repair which is bound to take a very long time if you have to many partitions and files are huge as you mentioned.

AWS update Athena meta: Glue Crawler vs MSCK Repair Table

When new partition is added to an Athena table, we could use either Glue Crawler or MSCK REPAIR TABLE to update meta info. What are the cost for them? Which one is preferred?

MSCK REPAIR TABLE command requires your S3 key to include the partition scheme as documented here. If your S3 key does not include the partition scheme, the MSCK REPAIR TABLE command will return missing partitions, but you will still have to add them in. Also one other difference is that the MSCK REPAIR TABLE command can time out after 30 minutes (default Athena query time length) while glue crawler will not.
Here is pricing information:
Glue Crawler:
There is an hourly rate for AWS Glue crawler runtime to discover data and populate the AWS Glue Data Catalog. You are charged an hourly rate based on the number of Data Processing Units (or DPUs) used to run your crawler. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. You are billed in increments of 1 second, rounded up to the nearest second, with a 10-minute minimum duration for each crawl. Use of AWS Glue crawlers is optional, and you can populate the AWS Glue Data Catalog directly through the API.
Pricing
For all AWS Regions where AWS Glue is available:
$0.44 per DPU-Hour, billed per second, with a 10-minute minimum per crawler run
Athena:
There are no charges for Data Definition Language (DDL) statements like CREATE/ALTER/DROP TABLE, statements for managing partitions, or failed queries.
However, on top of both of these commands you will still incur S3 costs. Reference: AWS Athena: does `msck repair table` incur costs?
My opinion is it is best to manage the partition yourself if you are able to, after adding new data.
'ALTER TABLE database.table ADD
PARTITION (partition_name='PartitionValue') location 's3://bucket/path/partition'
If forced to use Glue or Athena, I would evaluate which way will fit better into your process. The MSCK REPAIR TABLE command may be easier to manage but you may run into trouble if you have a lot of data in partitions or they are not partitioned correctly. Also, you will have to have a way to automate running the command. Glue Crawlers can be configured with triggers.

I agree with adding partitions manually. You can do this via an Athena query (ALTER TABLE ... ADD PARTITION () ...) as in the answer from #KiteCoder, or you can do this via the Glue API directly.
Calling the Glue API is more verbose, but also more 'structured'. Calling Athena is obviously a SQL query, and I know how many people despise writing code that dynamically generates SQL.
The specific operation is CreatePartition. It does require an object called StorageDescriptor which defines all the columns and data types in that table, but for an existing table you can retrieve that structure from the GetTable operation.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js