I want to create an empty Athena table over an S3 bucket which will hold rows from other Athena tables. After each day, the data in this table gets old for my use case and hence I have to drop it and create a new table and insert latest rows into it.This table will be filled with rows from other tables and has nothing to do with data in the S3 bucket it is being created in. I want to use this table to just insert rows into it from other Athena tables, so that I can use quicksight on the data.
Right now, when I Drop and Create the Athena table over the same S3 bucket, it shows all the previous rows that were inserted into it. So in essence, how do I delete these previously inserted rows and get a empty table each time?
Right now I do something like this:
DROP TABLE IF EXISTS athena-table
CREATE EXTERNAL TABLE IF NOT EXISTS athena-table LOCATION 'SAME_S3_BUCKET'
Related
I am trying to create a table using SQL Template, insert rows and fetch the data from newly created table in Athena. I gave a s3 location which already have a csv file with some data. I'm getting all the data(csv files data and table data which I inserted using insert query). I want to fetch only table data not others files data. Why this behavior is happening? I attached the screenshot of my query and s3 location.
On the LOCATION 's3:://some-location'; line, dont you want to add in there the exact file you want to build a table from?
i.e. `LOCATION 's3:://some-location/the-file.csv';
I am having a parquet file with the below structure
column_name_old
First
Second
I am crawling this file to a table in AWS glue, however in the schema table I want the table structure as below without changing anything in parquet files
column_name_new
First
Second
I tried updating table structure using boto3
col_list = js['Table']['StorageDescriptor']['Columns']
for x in col_list:
if isinstance(x, dict):
x.update({'Name': x['Name'].replace('column_name_old', 'column_name_new')})
And it works as I can see the table structure updated in Glue catalog, but when I query the table using the new column name I don't get any data as it seems the mapping between the table structure and partition files is lost.
Is this approach even possible or I must change the parquet files itself? If it's possible what I am doing wrong?
You can create a view of the column name mapped to other value.
I believe a change in the column name will break the meta catalogue.
I want a table to store the history of a object for a week and then replace the same with history of next week. What would be the best way to achieve this in aws?
The data is stored in json format in s3 is a weekly dump. The pipeline runs the script weekly once and dumps data into s3 for analysis. For the next run of the script i do not need the previous week-1 data, so this needs to be replaced with new week-2 data. The schema of the table remains constant but the data keeps changing every week.
I would recommend to use data partitioning to solve your issue without deleting underlying S3 files from previous weeks (which is not possible via an Athena query).
Thus, the idea is to use a partition key based on the date, and then use this partition key in the WHERE clause of your Athena request, which will cause Athena to ignore previous files (which are not under the last partition).
For example, if you use the file dump date as partition key (let's say we chose to name it dump_key), your files will have to be stored in subfolders like
s3://your-bucket/subfolder/dump_key=2021-01-01-13-00/files.csv
s3://your-bucket/subfolder/dump_key=2021-01-07-13-00/files.csv
Then, during your data processing, you'll first need to create your table and specify a partition key with the PARTITIONED BY option.
Then, you'll have to make sure you added a new partition using the PARTITION ADD command every time it's necessary for your use case:
ALTER TABLE your_table ADD PARTITION (dump_key='2021-01-07-13-00') location 's3://your-bucket/subfolder/dump_key=2021-01-07-13-00/'
Then you'll be able to query your table by filtering previous data using the right WHERE clause:
SELECT * FROM my_table WHERE dump_key >= 2021-01-05-00-00
This will cause Athena to ignore files in previous partitions when querying your table.
Documentation here:
https://docs.aws.amazon.com/athena/latest/ug/partitions.html
Is it possible to delete data stored in S3 through an Athena query? I have some rows I have to delete from a couple of tables (they point to separate buckets in S3).
I couldn't find a way to do it in the Athena User Guide: https://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf and DELETE FROM isn't supported, but I'm wondering if there is an easier way than trying to find the files in S3 and deleting them.
You can leverage Athena to find out all the files that you want to delete and then delete them separately. There is a special variable "$path".
Select "$path" from <table> where <condition to get row of files to delete>
To automate this, you can have iterator on Athena results and then get filename and delete them from S3.
I also would like to add that after you find the files to be updated you can filter the rows you want to delete, and create new files using CTAS:
https://docs.aws.amazon.com/athena/latest/ug/ctas.html
Later you can replace the old files with the new ones created by CTAS. I think it is the most simple way to go
The answer is Yes, now you can delete the data from Athena, recently AWS has introduced ICEBERG table which supports the ACID property.
You need to create an iceberg table that will have the same data as your Athena table(where you want to delete/update records) using the below steps.
Create ICEBERG TABLE
Create table new_iceberg_table
(id double, name string)
LOCATION 'S3://path/where/you/want/to_save/'
TBLPROPERTIES (table_type='iceberg')
Load data from your Data Catalogue into this new iceberg table.
Insert into datasource.new_iceberg_table
Select * from datasource.main_athena_table.
main_athena_table = Table where you want to perform Delete/Update or ACID.
new_iceberg_table = Newly created table
Now you can insert, update, and delete the data from iceberg table.
You can also time travel using SYSTEM_TIME.
Relevant SQL's
#Update SQL =
UPDATE from datasource.new_iceberg_table set id = 04 where name='ABC' ;
#Delete SQL =
DELETE from datasource.new_iceberg_table where name='ABC' ;
#Time travel SQL (In case you want to time travel and want to see the older data)
SELECT * from datasource.new_iceberg_table for SYSTEM_TIME as of (current_timestamp - interval '10' minutes) where name='ABC'
Thank you.
I would just like to add to Dhaval's answer.
You can find out the path of the file with the rows that you want to delete and instead of deleting the entire file, you can just delete the rows from the S3 file which I am assuming would be in the Json format.
The process is to download the particular file which has those rows, remove the rows from that file and upload the same file to S3.
This just replaces the original file with the one with modified data (in your case, without the rows that got deleted). After the upload, Athena would tranform the data again and the deleted rows won't show up.
Use AWS Glue for that.
Load your data, delete what you need to delete, save the data back.
Now you can also delete files from s3 and merge data: https://aws.amazon.com/about-aws/whats-new/2020/01/aws-glue-adds-new-transforms-apache-spark-applications-datasets-amazon-s3/
AWS has announced general availability of Iceberg integration with Athena and Athena now support DMLs at raw level for Iceberg tables.
UPDATE and DELETE rows can be done using SQLs:
DELETE FROM [db_name.]table_name [WHERE predicate]
UPDATE [db_name.]table_name SET xx=yy[,...] [WHERE predicate]
For more details - AWS DOCUMENT
Note that Athena Iceberg integration now is very restrictive - nested SQLs for Deletes and Updates are NOT supported.
Below query won't work:
Delete from table1 where uniqueid in (select b.uniqueid from delete_staging b)
Shivendra Singh's answer about ICEBERG should be accepted, as ICEBERG seems to answer all needs now. But if you need to stay on the Hive table, or if your files format is JSON and you need to keep it this way, you have the following option:
Use CTAS to create new table with values you want to keep. If it's hard to phrase query this way, you can always do something like where id not in (select id ...) or select * from ... except select * from .... If your table is partitioned, and after deletion there should be more than 100 partitions, you'll need to use "insert into" technique to create up to 100 partitions per query (https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html).
Move (just in case) original data from S3 for partitions that were relevant for the deletion
Move data that was created by (1)
Our s3 buckets generally have a number of sub-directories, so that the path to a bucket is something like s3:top-level-function-group/more-specific-folder/org-tenant-company-id/entityid/actual-data
We're looking into Athena to be able to query against data on that /actual-data level, but within the org-tenant-company-id, so that would have to be passed as some kind of parameter.
Or would that org-tenant-company-id be a partition?
is it possible to create an athena table that queries against this structure? And what would the s3 location be on the create table wizard? I tried it with s3:top-level-function-group/more-specific-folder/ but when it ran, I think it said something like '0 Kb data read'.
You can create a partitioned table as follows, where the partition keys are defined only in the PARTITIONED BY clause, not in the list of table fields:
CREATE EXTERNAL TABLE mydb.mytable (
id int,
stuff string,
...
)
PARTITIONED BY (
orgtenantcompanyid string
)
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/';
After creating the table, you can then load individual partitions:
ALTER TABLE mydb.mytable ADD PARTITION (orgtenantcompanyid='org1')
LOCATION 's3://mybucket/top-level-function-group/more-specific-folder/org1';
Result rows will contain the partition fields like orgtenantcompanyid.
Yes, it is possible to create tables that only use contents of a specific subdirectory.
It's normal that after creating your table you see 0kb read. That's because no data is read when you CREATE a table.
To check whether you can acutally query the data do something like:
SELECT * FROM <table_name> LIMIT 10
Partitioning only makes sense if the data structure is identical in all the different directories so that the table definition applies to all the data under the location.
And yes, it's possible to use the path structure to create partitions. However, not automatically if it's not in the right format /key=value/. You can use the path as an attribute, though, as explained here: How to get input file name as column in AWS Athena external tables