Is there any functionality to DELETE rows from a table, or even remove file from path list, or do I have to DROP the entire table and re-create it with the rows I need?
Also is there a TRUNCATE function?
Vora tables are based on files in HDFS. As of today (Vora1.2) you can only APPEND to an existing table. There is no DELETE or TRUNCATE functionality. You can however DROP the table and re-create it based on different files.
Related
I want a table to store the history of a object for a week and then replace the same with history of next week. What would be the best way to achieve this in aws?
The data is stored in json format in s3 is a weekly dump. The pipeline runs the script weekly once and dumps data into s3 for analysis. For the next run of the script i do not need the previous week-1 data, so this needs to be replaced with new week-2 data. The schema of the table remains constant but the data keeps changing every week.
I would recommend to use data partitioning to solve your issue without deleting underlying S3 files from previous weeks (which is not possible via an Athena query).
Thus, the idea is to use a partition key based on the date, and then use this partition key in the WHERE clause of your Athena request, which will cause Athena to ignore previous files (which are not under the last partition).
For example, if you use the file dump date as partition key (let's say we chose to name it dump_key), your files will have to be stored in subfolders like
s3://your-bucket/subfolder/dump_key=2021-01-01-13-00/files.csv
s3://your-bucket/subfolder/dump_key=2021-01-07-13-00/files.csv
Then, during your data processing, you'll first need to create your table and specify a partition key with the PARTITIONED BY option.
Then, you'll have to make sure you added a new partition using the PARTITION ADD command every time it's necessary for your use case:
ALTER TABLE your_table ADD PARTITION (dump_key='2021-01-07-13-00') location 's3://your-bucket/subfolder/dump_key=2021-01-07-13-00/'
Then you'll be able to query your table by filtering previous data using the right WHERE clause:
SELECT * FROM my_table WHERE dump_key >= 2021-01-05-00-00
This will cause Athena to ignore files in previous partitions when querying your table.
Documentation here:
https://docs.aws.amazon.com/athena/latest/ug/partitions.html
My partitions look like these
event_year=2019/event_week=37/event_date=2019-09-10
event_year=2019/event_week=42/event_date=2019-10-13
event_year=2019/event_week=8/event_date=2019-02-20
event_year=2020/event_week=24/event_date=2020-06-15
There are 1500 partitions like this how do I drop all the partitions at once?
The easiest and quickest way is to drop the table and recreate it. You can get the DDL with SHOW CREATE TABLE table_name if you don't have it.
If you really need to drop partitions and not the table the most efficient way is to use the Glue Data Catalog APIs to first list all partitions and then delete partitions in batches of 25.
It's not documented anywhere but sometimes in athena you can drop all partitions with
ALTER TABLE table_name DROP PARTITION (not_a_column=NULL)
This appears to be a side effect of being able to only specify one partition if you have a table partitioned on multiple dimensions.
If the above doesn't work then I fallback to using the awswrangler python library https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.catalog.delete_partitions.html
I am trying to drop all the partitions on an external table in a redshift cluster. I am unable to find an easy way to do it. I am currently doing this by running a dynamic query to select the dates from the table and concatenating it with the drop logic and taking the result set and running it separately like this
select 'ALTER TABLE procore_iad_ext.active_histories DROP PARTITION (values='''||rtrim(ltrim(values, '["'),'"]') ||''');' from svv_external_partitions
where tablename = 'xyz';
values looks like this ->["2009-03-10"]
Looking for a simpler direct solution. Thanks.
The easiest way to do this would be to drop the table itself. As long as you have the DDL to recreate the table and don't mind dropping all partitions, just DROP TABLE <schemaname>.<tablename>; then recreate the table. The new table will not have any partitions.
Please check out the Glue catalog. It provides a UI to easily delete the tables/partitions etc.
Is it possible to delete data stored in S3 through an Athena query? I have some rows I have to delete from a couple of tables (they point to separate buckets in S3).
I couldn't find a way to do it in the Athena User Guide: https://docs.aws.amazon.com/athena/latest/ug/athena-ug.pdf and DELETE FROM isn't supported, but I'm wondering if there is an easier way than trying to find the files in S3 and deleting them.
You can leverage Athena to find out all the files that you want to delete and then delete them separately. There is a special variable "$path".
Select "$path" from <table> where <condition to get row of files to delete>
To automate this, you can have iterator on Athena results and then get filename and delete them from S3.
I also would like to add that after you find the files to be updated you can filter the rows you want to delete, and create new files using CTAS:
https://docs.aws.amazon.com/athena/latest/ug/ctas.html
Later you can replace the old files with the new ones created by CTAS. I think it is the most simple way to go
The answer is Yes, now you can delete the data from Athena, recently AWS has introduced ICEBERG table which supports the ACID property.
You need to create an iceberg table that will have the same data as your Athena table(where you want to delete/update records) using the below steps.
Create ICEBERG TABLE
Create table new_iceberg_table
(id double, name string)
LOCATION 'S3://path/where/you/want/to_save/'
TBLPROPERTIES (table_type='iceberg')
Load data from your Data Catalogue into this new iceberg table.
Insert into datasource.new_iceberg_table
Select * from datasource.main_athena_table.
main_athena_table = Table where you want to perform Delete/Update or ACID.
new_iceberg_table = Newly created table
Now you can insert, update, and delete the data from iceberg table.
You can also time travel using SYSTEM_TIME.
Relevant SQL's
#Update SQL =
UPDATE from datasource.new_iceberg_table set id = 04 where name='ABC' ;
#Delete SQL =
DELETE from datasource.new_iceberg_table where name='ABC' ;
#Time travel SQL (In case you want to time travel and want to see the older data)
SELECT * from datasource.new_iceberg_table for SYSTEM_TIME as of (current_timestamp - interval '10' minutes) where name='ABC'
Thank you.
I would just like to add to Dhaval's answer.
You can find out the path of the file with the rows that you want to delete and instead of deleting the entire file, you can just delete the rows from the S3 file which I am assuming would be in the Json format.
The process is to download the particular file which has those rows, remove the rows from that file and upload the same file to S3.
This just replaces the original file with the one with modified data (in your case, without the rows that got deleted). After the upload, Athena would tranform the data again and the deleted rows won't show up.
Use AWS Glue for that.
Load your data, delete what you need to delete, save the data back.
Now you can also delete files from s3 and merge data: https://aws.amazon.com/about-aws/whats-new/2020/01/aws-glue-adds-new-transforms-apache-spark-applications-datasets-amazon-s3/
AWS has announced general availability of Iceberg integration with Athena and Athena now support DMLs at raw level for Iceberg tables.
UPDATE and DELETE rows can be done using SQLs:
DELETE FROM [db_name.]table_name [WHERE predicate]
UPDATE [db_name.]table_name SET xx=yy[,...] [WHERE predicate]
For more details - AWS DOCUMENT
Note that Athena Iceberg integration now is very restrictive - nested SQLs for Deletes and Updates are NOT supported.
Below query won't work:
Delete from table1 where uniqueid in (select b.uniqueid from delete_staging b)
Shivendra Singh's answer about ICEBERG should be accepted, as ICEBERG seems to answer all needs now. But if you need to stay on the Hive table, or if your files format is JSON and you need to keep it this way, you have the following option:
Use CTAS to create new table with values you want to keep. If it's hard to phrase query this way, you can always do something like where id not in (select id ...) or select * from ... except select * from .... If your table is partitioned, and after deletion there should be more than 100 partitions, you'll need to use "insert into" technique to create up to 100 partitions per query (https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html).
Move (just in case) original data from S3 for partitions that were relevant for the deletion
Move data that was created by (1)
I think Redshift as Relational Database. I am having one scenario which I wanted to know:-
If there is an existing redshift database, then how easy or how much time it would take to add a column? Does it take a lock on the table when added a new field?
Amazon Redshift is a columnar database. This means each column is stored separately on disk, which is why it operates so quickly.
Adding a column is therefore a relatively simple operation.
If you are worried about it, you could use the CREATE TABLE AS command to create a new table based on the current one, then add a column and see how long it takes.
Like any rdmbs column will be added in the order of entry. i.e it with the last columns (not in the middle or whereever you like)