Is it possible to change a database (schema) name in AWS Athena? - amazon-athena

I created a database and some tables with Data on AWS Athena and would like to rename the database without deleting and re-creating the tables and database. Is there a way to do this? I tried the standard SQL alter database but it doesn't seem to work.
thanks!

I'm afraid there is no way to do this according to this official forum thread. You would need to remove the database and re-create it. However, since Athena does not store any data by itself, deleting a table or a database won't impact your data stored on S3. Therefore, if you kept all the scripts that create external tables, re-creating a database should be fairly quick thing to do.

Athena doesn't support renaming database. You need to recreate database with a new name.
You can use Presto which is an open source version of Athena and Presto supports more DDL queries.

Related

How to add columns to an existing Athena table using Avro storage

I have an existing Athena table (w/ hive-style partitions) that's using the Avro SerDe. When I first created the table, I declared the Athena schema as well as the Athena avro.schema.literal schema per AWS instructions. Everything has been working great.
I now wish to add new columns that will apply going forward but not be present on the old partitions. I tried a basic ADD COLUMNS command that claims to succeed but has no impact on SHOW CREATE TABLE. I then wondered if I needed to change the Avro schema declaration as well, which I attempted to do but discovered that ALTER TABLE SET SERDEPROPERTIES DDL is not supported in Athena.
AWS claims I should be able to add columns when using Avro, but at this point I'm unsure how to do it. Even if I'm willing to drop the table metadata and redeclare all of the partitions, I'm not sure how to do it right since the schema is different on the historical partitions.
Looking for high-level guidance on the steps to be taken. Documentation is scant and Athena seems to be lacking support for commands that are referenced in this same scenario in vanilla Hive world. Thanks for any insights.

Retention and archival policy on Hive data

We have an AWS EMR which includes a Hive backed by aurora metadata and data stored in s3. There are programs that create the database(s) and tables inside in Hive and populate data.
After a while, these databases are no longer needed (say after 1 year). We want to delete those hive databases automatically after a set period. The usual way is to set a cron job that runs every month or so, to find the databases from an internal metadata table that are older than 1 year, and programmatically fire the queries in Hive which deletes it. But this has some drawbacks like Manually created tables are not being covered.
Is there any hive built-in feature that does the above?
Hive is actually just a metadata store that defines how data should be interpreted. It does not manage any of the underlying data. (This is a major difference between hive and a conventional database. And why hive can use multiple file backends(hdfs&S3) in the same hive instance.)
I'm going to guess you are using an s3 bucket for you data so you likely want to look into expiring objects. This will do exactly what you want. Delete data after a period of time. This will not disrupt hive.
If you are using partitions you may wish to do some additional cleanup.
MSCK REPAIR TABLE will help maintain the partitions in hive but is really slow in S3 and periodically can timeout. YMMV.
It's better to drop partitions:
ALTER TABLE bills DROP IF EXISTS PARTITION (mydate='2022-02') PURGE;
In Hive you can implement partitions retention (since Hive 3.1.0)
For example to drop partitions and their data after 7 days:
ALTER TABLE employees SET TBLPROPERTIES ('partition.retention.period'='7d');
There is not a hive internal tool that removes 'databases' according to a "retention period" in hive.
You have been doing this for a while so you are likely well aware of the risks of deleting metadata older than a year.
There are several ways to define retention on data, but none that I'm aware to remove metadata.
Things you could look at:
You could add a trigger to Aurora to delete tables directly from the hive metadata. (Hive tables have values for create time and they're last access time) you could create some logic to work at that level.

What is an effective way to copy Athena databases?

I have an Athena db that I would like to duplicate (with a different name of course). Right now, I don't have an effective way to do this. My approach right now is just to execute CTAS queries per table but this obviously doesn't work for databases with 100s of tables. I've also looked into using crawlers but unfortunately, I do not always have a reliable S3 export for this.
Is there a way for me to duplicate Athena databases? Perhaps a shell script?

AWS Glue crawler need to create one table from many files with identical schemas

We have a very large number of folders and files in S3, all under one particular folder, and we want to crawl for all the CSV files, and then query them from one table in Athena. The CSV files all have the same schema. The problem is that the crawler is generating a table for every file, instead of one table. Crawler configurations have a checkbox option to "Create a single schema for each S3 path" but this doesn't seem to do anything.
Is what I need possible? Thanks.
Glue crawlers claims to solve many problems, but in fact solves few. If you're slightly outside the scope of what they designed for you're out of luck. There might be a way to configure it to do what you want, but in my experience trying to make Glue crawlers do things that aren't perfectly aligned with it is not worth the effort.
It sounds like you have a good idea of what the schema of your data is. When that is the case Glue crawlers also provide very little value. You probably have a better idea of what the schema should look than Glue will ever be able to figure out.
I suggest that you manually create the table, and write a one off script that lists all the partition locations on S3 that you want to include in the table and generate ALTER TABLE ADD PARTITION … SQL, or Glue API calls to add those partitions to the table.
To keep the table up to date when new partition locations are added, have a look at this answer for guidance: https://stackoverflow.com/a/56439429/1109
One way to do what you want is to use just one of the tables created by the crawler as an example, and create a similar table manually (in AWS Glue->Tables->Add tables, or in Athena itself, with
CREATE EXTERNAL TABLE `tablename`(
`column1` string,
`column2` string, ...
using existing table as an example, you can see the query used to create that table in Athena when you go to Database -> select your data base from Glue Data Catalog, then click on 3 dots in front of the one "automatically created by crawler table" that you choose as an example, and click on "Generate Create table DDL" option. It will generate a big query for you, modify it as necessary (I believe you need to look at LOCATION and TBLPROPERTIES parts, mostly).
When you run this modified query in Athena, a new table will appear in Glue data catalog. But it will not have any information about your s3 files and partitions, and crawler most likely will not update metastore info for you. So you can in Athena run "MSCK REPAIR TABLE tablename;" query (it's not very efficient, but works for me), and it will add missing file information, in the Result tab you will see something like (in case you use partitions on s3, of course):
Partitions not in metastore: tablename:dt=2020-02-03 tablename:dt=2020-02-04
Repair: Added partition to metastore tablename:dt=2020-02-03
Repair: Added partition to metastore tablename:dt=2020-02-04
After that you should be able to run your Athena queries.

AWS Glue: Do I really need a Crawler for new content?

What I understand from the AWS Glue docs is a craweler will help crawl and discover new data. However, I noticed that once I crawled once, if new data goes into S3, the data is actually already discovered when I query the data catalog from Athena for example. So, can I say I do not need a crawler to crawl everytime new data is added, unless there are new schemas?
In fact, if I know the schema of the files, I can just manually create the table and do without a crawler, am I correct?
If data is partitioned by some keys (placed in sub-folders, like /data/year=2018/month=11/day=2) then you need a crawler to register newly added partitions (ie. /day=3) in Data Catalog to be able to query it via Athena.
However, if data is not partitined or comes into already registered partitions then there is no need to run a crawler.
Alternatively to runnig a crawler you can discover and register new partitions by running Athena command MSCK REPAIR TABLE <table> or registering them manually.
The easiest way to create a table in Data Catalog is running a crawler. But if you know schema and have patience to compose CREATE TABLE Athena query or fill all fields via AWS Glue console then you can go that way as well.
If you have the schema then you don't need to use the crawler and you might get better results (the crawler assumes partition columns are strings for example).
As Yuriy says, remember to run MSCK REPAIR TABLE or register new partitions manually.
MSCK can time out if you've added a lot of partitions. If it does, keep running it until it completes normally.