Update AWS Athena data & table to rename columns - amazon-web-services

Today, I saw myself with a simple problem, renaming column of an Athena glue table from old to new name.
First thing, I search here and tried some solutions like this, this, and many others. Unfortunately, none works, so I decided to use my knowledge and imagination.
I'm posting this question with the intention of share, but also, with the intention to get how others did and maybe find out I reinvented the wheel. So please also share your way if you know how to do it.
My setup is, a Athena JSON table partitioned by day with valuable and enormous amount of data, the infrastructure is defined and updated through Cloudformation.
How to rename an Athena column and still keep the data?

Explaining without all the cloudformation infrastructure.
Imagine a table containing:
userId
score
otherColumns
eventDateUtc
dt_utc
Partitioned by dt_utc and stored using JSON format. Wee need to change the column score to deltaScore.
Keep in mind, although I haven't tested with others format/configurations, this should apply to any configuration supported by athena as we are going to use athena algorithm to do the job for us.
How to do
if you run the cloudformation migration first, you gonna "lose" access to the dropped column.
but you can simply rename the column back and the data appears.
Those are the steps required for rename a AWS Athena table:
Create a temporary table mapping the old column name to the new one:
This can be done by use of CREATE TABLE AS, read more in the aws docs
With this command, we use Athena engine to apply the transformation on the files of the original table for us and save at s3://bucket_name/A_folder/temp_table_rename/.
CREATE TABLE "temp_table_rename"
WITH(
format = 'JSON',
external_location = 's3://bucket_name/A_folder/temp_table_rename/',
partitioned_by = ARRAY['dt_utc']
)
AS
SELECT DISTINCT
userid,
score as deltascore,
otherColumns,
eventDateUtc,
"dt_utc"
FROM "my_database"."original_table"
Apply the database rename by running the cloudformation with the changes or on the way you have.
At this point, you can even drop the original_table, and create again using the right column name.
After rename, you will notice that the renamed column have no data.
Remove the data of the original table by deleting it's s3 source.
Copy the data from the temp table source to the original table source
I prefer to use a aws command as, there can be thousands of files to copy
aws s3 cp s3://bucket_name/A_folder/temp_table_rename/ s3://bucket_name/A_folder/original_table/ --recursive
Restore the index of the original table
MSCK REPAIR TABLE "my_database"."original_table"
done.
Final notes:
Using CREATE TABLE AS to do the transformation job, allow you to do much more than only renaming the column, for example split the data of a column into 2 new columns, or merge it to a single one.

Related

How to add columns to an existing Athena table using Avro storage

I have an existing Athena table (w/ hive-style partitions) that's using the Avro SerDe. When I first created the table, I declared the Athena schema as well as the Athena avro.schema.literal schema per AWS instructions. Everything has been working great.
I now wish to add new columns that will apply going forward but not be present on the old partitions. I tried a basic ADD COLUMNS command that claims to succeed but has no impact on SHOW CREATE TABLE. I then wondered if I needed to change the Avro schema declaration as well, which I attempted to do but discovered that ALTER TABLE SET SERDEPROPERTIES DDL is not supported in Athena.
AWS claims I should be able to add columns when using Avro, but at this point I'm unsure how to do it. Even if I'm willing to drop the table metadata and redeclare all of the partitions, I'm not sure how to do it right since the schema is different on the historical partitions.
Looking for high-level guidance on the steps to be taken. Documentation is scant and Athena seems to be lacking support for commands that are referenced in this same scenario in vanilla Hive world. Thanks for any insights.

How does Amazon Athena manage rename of columns?

everyone!
I'm working on a solution that intends to use Amazon Athena to run SQL queries from Parquet files on S3.
Those filed will be generated from a PostgreSQL database (RDS). I'll run a query and export data to S3 using Python's Pyarrow.
My question is: since Athena is schema-on-read, add or delete of columns on database will not be a problem...but what will happen when I get a column renamed on database?
Day 1: COLUMNS['col_a', 'col_b', 'col_c']
Day 2: COLUMNS['col_a', 'col_beta', 'col_c']
On Athena,
SELECT col_beta FROM table;
will return only data from Day 2, right?
Is there a way that Athena knows about these schema evolution or I would have to run a script to iterate through all my files on S3, rename columns and update table schema on Athena from 'col_a' to 'col_beta'?
Would AWS Glue Data Catalog help in any way to solve this?
I'll love to discuss more about this!
I recommend reading more about handling schema updates with Athena here. Generally Athena supports multiple ways of reading Parquet files (as well as other columnar data formats such as ORC). By default, using Parquet, columns will be read by name, but you can change that to reading by index as well. Each way has its own advantages / disadvantages dealing with schema changes. Based on your example, you might want to consider reading by index if you are sure new columns are only appended to the end.
A Glue crawler can help you to keep your schema updated (and versioned), but it doesn't necessarily help you to resolve schema changes (logically). And it comes at an additional cost, of course.
Another approach could be to use a schema that is a superset of all schemas over time (using columns by name) and define a view on top of it to resolve changes "manually".
You can set a granularity based on 'On Demand' or 'Time Based' for the AWS Glue crawler, so every time your data on the S3 updates a new schema will be generated (you can edit the schema on the data types for the attributes). This way your columns will stay updated and you can query on the new field.
Since AWS Athena reads data in CSV and TSV in the "order of the columns" in the schema and returns them in the same order. It does not use column names for mapping data to a column, which is why you can rename columns in CSV or TSV without breaking Athena queries.

Is it possible delete entire table stored in S3 buckets from athena query?

I want a table to store the history of a object for a week and then replace the same with history of next week. What would be the best way to achieve this in aws?
The data is stored in json format in s3 is a weekly dump. The pipeline runs the script weekly once and dumps data into s3 for analysis. For the next run of the script i do not need the previous week-1 data, so this needs to be replaced with new week-2 data. The schema of the table remains constant but the data keeps changing every week.
I would recommend to use data partitioning to solve your issue without deleting underlying S3 files from previous weeks (which is not possible via an Athena query).
Thus, the idea is to use a partition key based on the date, and then use this partition key in the WHERE clause of your Athena request, which will cause Athena to ignore previous files (which are not under the last partition).
For example, if you use the file dump date as partition key (let's say we chose to name it dump_key), your files will have to be stored in subfolders like
s3://your-bucket/subfolder/dump_key=2021-01-01-13-00/files.csv
s3://your-bucket/subfolder/dump_key=2021-01-07-13-00/files.csv
Then, during your data processing, you'll first need to create your table and specify a partition key with the PARTITIONED BY option.
Then, you'll have to make sure you added a new partition using the PARTITION ADD command every time it's necessary for your use case:
ALTER TABLE your_table ADD PARTITION (dump_key='2021-01-07-13-00') location 's3://your-bucket/subfolder/dump_key=2021-01-07-13-00/'
Then you'll be able to query your table by filtering previous data using the right WHERE clause:
SELECT * FROM my_table WHERE dump_key >= 2021-01-05-00-00
This will cause Athena to ignore files in previous partitions when querying your table.
Documentation here:
https://docs.aws.amazon.com/athena/latest/ug/partitions.html

AWS Glue crawler need to create one table from many files with identical schemas

We have a very large number of folders and files in S3, all under one particular folder, and we want to crawl for all the CSV files, and then query them from one table in Athena. The CSV files all have the same schema. The problem is that the crawler is generating a table for every file, instead of one table. Crawler configurations have a checkbox option to "Create a single schema for each S3 path" but this doesn't seem to do anything.
Is what I need possible? Thanks.
Glue crawlers claims to solve many problems, but in fact solves few. If you're slightly outside the scope of what they designed for you're out of luck. There might be a way to configure it to do what you want, but in my experience trying to make Glue crawlers do things that aren't perfectly aligned with it is not worth the effort.
It sounds like you have a good idea of what the schema of your data is. When that is the case Glue crawlers also provide very little value. You probably have a better idea of what the schema should look than Glue will ever be able to figure out.
I suggest that you manually create the table, and write a one off script that lists all the partition locations on S3 that you want to include in the table and generate ALTER TABLE ADD PARTITION … SQL, or Glue API calls to add those partitions to the table.
To keep the table up to date when new partition locations are added, have a look at this answer for guidance: https://stackoverflow.com/a/56439429/1109
One way to do what you want is to use just one of the tables created by the crawler as an example, and create a similar table manually (in AWS Glue->Tables->Add tables, or in Athena itself, with
CREATE EXTERNAL TABLE `tablename`(
`column1` string,
`column2` string, ...
using existing table as an example, you can see the query used to create that table in Athena when you go to Database -> select your data base from Glue Data Catalog, then click on 3 dots in front of the one "automatically created by crawler table" that you choose as an example, and click on "Generate Create table DDL" option. It will generate a big query for you, modify it as necessary (I believe you need to look at LOCATION and TBLPROPERTIES parts, mostly).
When you run this modified query in Athena, a new table will appear in Glue data catalog. But it will not have any information about your s3 files and partitions, and crawler most likely will not update metastore info for you. So you can in Athena run "MSCK REPAIR TABLE tablename;" query (it's not very efficient, but works for me), and it will add missing file information, in the Result tab you will see something like (in case you use partitions on s3, of course):
Partitions not in metastore: tablename:dt=2020-02-03 tablename:dt=2020-02-04
Repair: Added partition to metastore tablename:dt=2020-02-03
Repair: Added partition to metastore tablename:dt=2020-02-04
After that you should be able to run your Athena queries.

Can I delete data (rows in tables) from Athena?

Is it possible to delete data stored in S3 through an Athena query? I have some rows I have to delete from a couple of tables (they point to separate buckets in S3).
I couldn't find a way to do it in the Athena User Guide: https://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf and DELETE FROM isn't supported, but I'm wondering if there is an easier way than trying to find the files in S3 and deleting them.
You can leverage Athena to find out all the files that you want to delete and then delete them separately. There is a special variable "$path".
Select "$path" from <table> where <condition to get row of files to delete>
To automate this, you can have iterator on Athena results and then get filename and delete them from S3.
I also would like to add that after you find the files to be updated you can filter the rows you want to delete, and create new files using CTAS:
https://docs.aws.amazon.com/athena/latest/ug/ctas.html
Later you can replace the old files with the new ones created by CTAS. I think it is the most simple way to go
The answer is Yes, now you can delete the data from Athena, recently AWS has introduced ICEBERG table which supports the ACID property.
You need to create an iceberg table that will have the same data as your Athena table(where you want to delete/update records) using the below steps.
Create ICEBERG TABLE
Create table new_iceberg_table
(id double, name string)
LOCATION 'S3://path/where/you/want/to_save/'
TBLPROPERTIES (table_type='iceberg')
Load data from your Data Catalogue into this new iceberg table.
Insert into datasource.new_iceberg_table
Select * from datasource.main_athena_table.
main_athena_table = Table where you want to perform Delete/Update or ACID.
new_iceberg_table = Newly created table
Now you can insert, update, and delete the data from iceberg table.
You can also time travel using SYSTEM_TIME.
Relevant SQL's
#Update SQL =
UPDATE from datasource.new_iceberg_table set id = 04 where name='ABC' ;
#Delete SQL =
DELETE from datasource.new_iceberg_table where name='ABC' ;
#Time travel SQL (In case you want to time travel and want to see the older data)
SELECT * from datasource.new_iceberg_table for SYSTEM_TIME as of (current_timestamp - interval '10' minutes) where name='ABC'
Thank you.
I would just like to add to Dhaval's answer.
You can find out the path of the file with the rows that you want to delete and instead of deleting the entire file, you can just delete the rows from the S3 file which I am assuming would be in the Json format.
The process is to download the particular file which has those rows, remove the rows from that file and upload the same file to S3.
This just replaces the original file with the one with modified data (in your case, without the rows that got deleted). After the upload, Athena would tranform the data again and the deleted rows won't show up.
Use AWS Glue for that.
Load your data, delete what you need to delete, save the data back.
Now you can also delete files from s3 and merge data: https://aws.amazon.com/about-aws/whats-new/2020/01/aws-glue-adds-new-transforms-apache-spark-applications-datasets-amazon-s3/
AWS has announced general availability of Iceberg integration with Athena and Athena now support DMLs at raw level for Iceberg tables.
UPDATE and DELETE rows can be done using SQLs:
DELETE FROM [db_name.]table_name [WHERE predicate]
UPDATE [db_name.]table_name SET xx=yy[,...] [WHERE predicate]
For more details - AWS DOCUMENT
Note that Athena Iceberg integration now is very restrictive - nested SQLs for Deletes and Updates are NOT supported.
Below query won't work:
Delete from table1 where uniqueid in (select b.uniqueid from delete_staging b)
Shivendra Singh's answer about ICEBERG should be accepted, as ICEBERG seems to answer all needs now. But if you need to stay on the Hive table, or if your files format is JSON and you need to keep it this way, you have the following option:
Use CTAS to create new table with values you want to keep. If it's hard to phrase query this way, you can always do something like where id not in (select id ...) or select * from ... except select * from .... If your table is partitioned, and after deletion there should be more than 100 partitions, you'll need to use "insert into" technique to create up to 100 partitions per query (https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html).
Move (just in case) original data from S3 for partitions that were relevant for the deletion
Move data that was created by (1)