Retention and archival policy on Hive data - amazon-web-services

We have an AWS EMR which includes a Hive backed by aurora metadata and data stored in s3. There are programs that create the database(s) and tables inside in Hive and populate data.
After a while, these databases are no longer needed (say after 1 year). We want to delete those hive databases automatically after a set period. The usual way is to set a cron job that runs every month or so, to find the databases from an internal metadata table that are older than 1 year, and programmatically fire the queries in Hive which deletes it. But this has some drawbacks like Manually created tables are not being covered.
Is there any hive built-in feature that does the above?

Hive is actually just a metadata store that defines how data should be interpreted. It does not manage any of the underlying data. (This is a major difference between hive and a conventional database. And why hive can use multiple file backends(hdfs&S3) in the same hive instance.)
I'm going to guess you are using an s3 bucket for you data so you likely want to look into expiring objects. This will do exactly what you want. Delete data after a period of time. This will not disrupt hive.
If you are using partitions you may wish to do some additional cleanup.
MSCK REPAIR TABLE will help maintain the partitions in hive but is really slow in S3 and periodically can timeout. YMMV.
It's better to drop partitions:
ALTER TABLE bills DROP IF EXISTS PARTITION (mydate='2022-02') PURGE;

In Hive you can implement partitions retention (since Hive 3.1.0)
For example to drop partitions and their data after 7 days:
ALTER TABLE employees SET TBLPROPERTIES ('partition.retention.period'='7d');

There is not a hive internal tool that removes 'databases' according to a "retention period" in hive.
You have been doing this for a while so you are likely well aware of the risks of deleting metadata older than a year.
There are several ways to define retention on data, but none that I'm aware to remove metadata.
Things you could look at:
You could add a trigger to Aurora to delete tables directly from the hive metadata. (Hive tables have values for create time and they're last access time) you could create some logic to work at that level.

Related

Querying Latest Available Partition in Athena

I am building an ETL pipeline using primarily state machines, Athena, S3, and the Glue catalog. In general things work in the following way:
A table, partitioned by "version", exists in the Glue Catalog. The table represents the output destination of some ETL process.
A step function (managed by some other process) executes "INSERT INTO" athena queries. The step function supplies a "version" that is used as part of the "INSERT INTO" query so that new data can be appended into the table defined in (1). The table contains all "versions" - it's a historical table that grows over time.
My question is: What is a good way of exposing a view/table that allows someone (or something) to query only the latest "version" partition for a given historically partitioned table?
I've looked into other table types AWS offers, including Governed tables and Iceberg tables. Each seems to have some incompatibility with our existing or planned future architecture:
Governed tables do not support writes via athena insert queries. Only Glue ETL/Spark seems to be supported at the moment.
Iceberg tables do not support Lake Formation data filters (which we'd like to use in the future to control data access)
Iceberg tables also seem to have poor performance. Anecdotally, it can take several seconds to insert a very small handful of rows to a given iceberg table. I'd worry about future performance when we want to insert a million rows.

AWS Glue crawler need to create one table from many files with identical schemas

We have a very large number of folders and files in S3, all under one particular folder, and we want to crawl for all the CSV files, and then query them from one table in Athena. The CSV files all have the same schema. The problem is that the crawler is generating a table for every file, instead of one table. Crawler configurations have a checkbox option to "Create a single schema for each S3 path" but this doesn't seem to do anything.
Is what I need possible? Thanks.
Glue crawlers claims to solve many problems, but in fact solves few. If you're slightly outside the scope of what they designed for you're out of luck. There might be a way to configure it to do what you want, but in my experience trying to make Glue crawlers do things that aren't perfectly aligned with it is not worth the effort.
It sounds like you have a good idea of what the schema of your data is. When that is the case Glue crawlers also provide very little value. You probably have a better idea of what the schema should look than Glue will ever be able to figure out.
I suggest that you manually create the table, and write a one off script that lists all the partition locations on S3 that you want to include in the table and generate ALTER TABLE ADD PARTITION … SQL, or Glue API calls to add those partitions to the table.
To keep the table up to date when new partition locations are added, have a look at this answer for guidance: https://stackoverflow.com/a/56439429/1109
One way to do what you want is to use just one of the tables created by the crawler as an example, and create a similar table manually (in AWS Glue->Tables->Add tables, or in Athena itself, with
CREATE EXTERNAL TABLE `tablename`(
`column1` string,
`column2` string, ...
using existing table as an example, you can see the query used to create that table in Athena when you go to Database -> select your data base from Glue Data Catalog, then click on 3 dots in front of the one "automatically created by crawler table" that you choose as an example, and click on "Generate Create table DDL" option. It will generate a big query for you, modify it as necessary (I believe you need to look at LOCATION and TBLPROPERTIES parts, mostly).
When you run this modified query in Athena, a new table will appear in Glue data catalog. But it will not have any information about your s3 files and partitions, and crawler most likely will not update metastore info for you. So you can in Athena run "MSCK REPAIR TABLE tablename;" query (it's not very efficient, but works for me), and it will add missing file information, in the Result tab you will see something like (in case you use partitions on s3, of course):
Partitions not in metastore: tablename:dt=2020-02-03 tablename:dt=2020-02-04
Repair: Added partition to metastore tablename:dt=2020-02-03
Repair: Added partition to metastore tablename:dt=2020-02-04
After that you should be able to run your Athena queries.

Unloading & reloading data between S3 and Redshift with schema changes

I'm interested in setting up some automated jobs that will periodically export data from our Redshift instance and store it on S3, where ideally it will then be bubbled back up into Redshift via an external table running in Redshift Spectrum. One thing I'm not sure of how to best deal with is the case of certain tables I'm working with changing in schema over time.
I'm able to both UNLOAD data from Redshift to S3 without a problem, and I'm also able to set up an external table within Redshift and have that S3 data available for querying. However, I'm not sure how to best deal with cases where our tables will change columns over time. For example, in the case of certain event data we capture through Segment, traits that get added will result in a new column on the Redshift table that won't have existed in previous UNLOADs. In Redshift, the column value for data that came in before the column existed will just result in NULL values.
What are best way to deal deal with this gradual change in data structure over time? If I just update the new fields in our external table will Redshift be able to deal with the fact that these fields don't necessarily exist on the older UNLOADs, or do I need to go some other route?

AWS Glue: Do I really need a Crawler for new content?

What I understand from the AWS Glue docs is a craweler will help crawl and discover new data. However, I noticed that once I crawled once, if new data goes into S3, the data is actually already discovered when I query the data catalog from Athena for example. So, can I say I do not need a crawler to crawl everytime new data is added, unless there are new schemas?
In fact, if I know the schema of the files, I can just manually create the table and do without a crawler, am I correct?
If data is partitioned by some keys (placed in sub-folders, like /data/year=2018/month=11/day=2) then you need a crawler to register newly added partitions (ie. /day=3) in Data Catalog to be able to query it via Athena.
However, if data is not partitined or comes into already registered partitions then there is no need to run a crawler.
Alternatively to runnig a crawler you can discover and register new partitions by running Athena command MSCK REPAIR TABLE <table> or registering them manually.
The easiest way to create a table in Data Catalog is running a crawler. But if you know schema and have patience to compose CREATE TABLE Athena query or fill all fields via AWS Glue console then you can go that way as well.
If you have the schema then you don't need to use the crawler and you might get better results (the crawler assumes partition columns are strings for example).
As Yuriy says, remember to run MSCK REPAIR TABLE or register new partitions manually.
MSCK can time out if you've added a lot of partitions. If it does, keep running it until it completes normally.

Can AWS Athena update or insert data stored in S3?

The document just says that it is a query service but not explicitly states that it can or cannot perform data update.
If Athena cannot do insert or update, is there any other aws service which can do like a normal DB?
Amazon Athena is, indeed, a query service -- it only allows data to be read from Amazon S3.
One exception, however, is that the results of the query are automatically written to S3. You could, therefore, use a query to generate results that could be used by something else. It's not quite updating data but it is generating data.
My previous attempts to use Athena output in another Athena query didn't work due to problems with the automatically-generated header, but there might be some workarounds available.
If you are seeking a service that can update information in S3, you could use Amazon EMR, which is basically a managed Hadoop cluster. Very powerful and capable, and can most certainly update information in S3, but it is rather complex to learn.
Amazon Athena adds support for inserting data into a table using the results of a SELECT query or using a provided set of values
Amazon Athena now supports inserting new data to an existing table using the INSERT INTO statement.
https://aws.amazon.com/about-aws/whats-new/2019/09/amazon-athena-adds-support-inserting-data-into-table-results-of-select-query/
https://docs.aws.amazon.com/athena/latest/ug/insert-into.html
Bucketed tables not supported
INSERT INTO is not supported on bucketed tables. For more information, see Bucketing vs Partitioning.
AWS S3 is a object storage. Both Athena and S3 Select is for queries. The only way to modify a object(file) in S3 is to retrieve from S3, modify and upload back to S3.
As of September 20, 2019 Athena also supports INSERT INTO: https://aws.amazon.com/about-aws/whats-new/2019/09/amazon-athena-adds-support-inserting-data-into-table-results-of-select-query/
Finally there is a solution from AWS. Now you can perform CRUD (create, read, update and delete) operations on AWS Athena. Athena Iceberg integration is generally available now. Create the table with:
TBLPROPERTIES ( 'table_type' ='ICEBERG' [, property_name=property_value])
then you can use it's amazing feature.
For a quick introduction, you can watch this video. (Or search Insert / Update / Delete on S3 With Amazon Athena and Apache Iceberg | Amazon Web Services on Youtube)
Read Considerations and Limitations
Athena supports CTAS (create table as) statements as of October 2018. You can specify output location and file format among other options.
https://docs.aws.amazon.com/athena/latest/ug/ctas.html
To INSERT into tables you can write additional files in the same format to the S3 path for a given table (this is somewhat of a hack), or preferably add partitions for the new data.
Like many big data systems, Athena is not capable of handling UPDATE statements.
We could use something known as Apache Iceberg in collaboration with Athena to perform CRUD operations on S3 data inside AWS itself.
The only caveat being that at the time of table creation we need to use extra parameter as table_type = 'ICEBERG'.
Eg:
create table demo
(
id string,
attr1 string
)
location 's3://path'
TBLPROPERTIES (
'table_type' = 'ICEBERG'
)
For more details : https://www.youtube.com/watch?v=u1v666EXCJw