I am trying to s 1) Enable the versioning on S3 buckets and 2) Delete previous versions after say 30 days. Do you know which lifecycle rule should I be setting to achieve 2?
One of the rule is Permanently delete previous versions of objects. Under that rule, you need to set Number of days after objects become previous version. The language in the public doc is not clear. Does that number mean after those number of days, the current S3 object becomes previous and gets deleted? In that case I will loose the S3 objects right?
Can someone help if my above understanding is correct?
Which rule should I set so that current version will be intact and only previous versions to be deleted after 30 days?
I looked at these examples, but all of them attempts to simply delete any S3 object that is older than 30 days. But I am trying to delete only the previous versions of the object.
Examples
1: Deleting old object versions on AWS S3
2: AWS: Delete Permanently S3 objects less than 30 days using 'Lifecycle Rule'
Thanks,
Pavan
You would use "Permanently delete previous versions of objects".
You would then enter "Number of days after objects become previous versions", which tells it to delete the object after than many days from when the version was not the Current version.
A Version will only ever become a Previous (non-current) version if a new version of the object is uploaded with the same name.
Does that number mean after those number of days, the current S3 object becomes previous and gets deleted
The current version will remain the current version until a user (eg You!) uploads a file with the same name. That will become the current version, and the "Current" version that was previously there becomes a "Previous" version.
Related
I have a S3 bucket which is version enabled, I want to permanently delete all the delete marked version objects from the S3 bucket using lifecycle rule.
Which of the below options we need to choose, in order to permanently delete the versions of the objects.
And also the delete marked objects may also be current version.
Let's understand what is non current version and current version?
Whenever versioned bucket object is deleted current object version becomes noncurrent, and the delete marker becomes the current version.
what is expired delete marker?
A delete marker with zero noncurrent versions is referred to as an expired object delete marker.
So option 4 and 5 will solve your purpose
option 4 will permananelty delete non current objects, which will make delete marker as expired since there will no non current version
option 5 will delete expired delete markers
Note: Lifecycle rule policies takes time to take effect as objects are queued and it happens in an asynchronous manner.
I am having trouble understanding the relationship between Delete markers and Lifecycle rules in S3 with versioning.
Here is my lifecycle rule:
Does this mean that the Expire current versions of objects option would delete a delete marker after it becomes the "current" version? AKA 180 days after the actual current version is deleted?
From what I understand this would mean:
After 180 days the current version of example.txt would expire, a delete marker is created for it and the current version becomes a noncurrent version attached to the delete marker
Another 180 days later, the noncurrent example.txt would be permanently deleted and the "current" version (the delete marker) would "expire", and since it is a delete marker that means it would be permanently deleted as well
Is this an accurate understanding or do I need to make an additional lifecycle rule that deletes expired delete markers?
Thank you!
Yes you are correct
Delete marker on current version makes it non current version
After another 180 days, since you are deleting non current version it will Alsop delete delete markers of non current version too.
An expired object delete marker is removed if all noncurrent versions of an object expire after deleting a versioned object.
That is why you cannot expire current version and delete expire object markers together.
Refer Removing expired object delete markers , and understanding delete markers for more detail
I am writing files to a S3 bucket. How can I see the newly added files? E.g. in the below pic, you can see the files are not ordered by Last modified field. And I can't find a way to do any sort on that field or any other field.
You cannot sort on that, it is just how the UI works.
The main reason being that for buckets with 1000+ objects the UI only "knows" about the current 1000 elements displayed on the current page. And sorting them is meaningless because it would imply to show you the newest or oldest 1000 objects of the bucket but in fact it would just order the currently displayed 1000 objects. That would really confuse people and it is better to not let the user sort instead of sorting incorrectly.
Showing the actual 1000 newest or oldest objects requires you to list everything in the bucket, which takes time (minutes or hours for larger buckets) and backend requests and incurs more of a cost since List requests are billed. If you want to retrieve the 1000 newest or oldest you need to write code to do a full listing on the bucket or the prefix, then order all objects and then display parts of the result.
If you can sufficiently decrease the number of displayed objects with the "Find objects by prefix" field, the sort options become available and meaningful.
I have a DynamoDB table that has a created date/time column that indicates when the record/item was inserted into the table. I have about 20 years worth of data in this table (records were migrated from a previous database), and I would now like to truncate anything older than 6 months old moving forward.
The obvious thing to do here would be to set a TTL on the table for 6 months, however my understanding is that AWS TTLs only go back a certain number of years (please correct me if you know otherwise!). So my understanding is that if I set a 6 month TTL on 20 years of data, I might delete record starting at 6 months old going back maybe 3 - 5 yearrs, but then there'd be a whole lot of really old data left over, unaffected by the TTL (again please correct me if you know otherwise!). So I guess I'm looking for:
The ability to do a manual, one-time deletion of data older than 6 months old; and
The ability to set a 6 month TTL moving forward
For the first one, I need to execute something like DELETE FROM mytable WHERE created > '2018-06-25', however I can't figure out how to do this from the AWS/DynamoDB management console, any ideas?
For the second part, when I go to Manage TTL in the DynamoDB console:
I'm not actually seeing where I would set the 6 month expiry. Is it the date/time fields at the very bottom of that dialog?! Seems strange to me...if that were the case then the TTL wouldn't be a scrolling 6 month window, it would just be a hardcoded point in time which I'd need to keep updating manually so that data is never more than 6 months old...
You are correct about how far back in time TTL goes, it's actually 5 years. The way it works is comparing your TTL attribute value with the current timestamp. If your item has a timestamp that is older than the current timestamp, it's scheduled for deletion in the next 48 hours (it's not immediate). So, if you use the timestamp of creation of the item, everything will be scheduled for deletion as soon as you insert, and that's not what you want.
The way you manage the 6-month expiry policy is in your application. When you create an item, set a TTL attribute to a timestamp 6 months ahead of the creation time and just leave it there. Dynamo will take care of deleting it in 6 months. For your "legacy" data, I can't see a way around querying and looping through each item and setting the TTL for each of them manually.
Deleting old records directly or updating their TTL so they can be deleted later by DynamoDB both require the same write capacity. You’ll need to scan / query and delete records one-by-one.
If you have, let’s say, 90% of old data, the most cost- and time-efficient way of deleting it is to move remaining 10% to a new table and delete the old one.
Another non-standard way I see is to choose an existent timestamp field you can sacrifice (for instance, audit field such as creation date), remove it from the new records and use as TTL to delete the old ones. It will allow you to do what you need cheaper and without switching to another table that may require multi-step changes in your application, but requires the field to (a) not being in use, (b) be in the past and (c) be a UNIX timestamp. If you don’t want to delete it permanently, you may copy it to another attribute and copy back after all old records have been deleted and TTL on that field is turned off (or switched to another attribute). It will not work for records having timestamp before 5 years ago.
As far as I know, underlying files (columnar format) is immutable. My question is, if files are immutable, how the updates are being performed. Do Snowflake maintains different versions of the same row, and returns the latest version based on key? or it inserts the data into new files behind the scene and deletes old files? How performance gets affected in these scenarios (querying current data), if time travel is set to 90 days as Snowflake need to maintain different version of the same row. But as Snowflake doesn't respect keys, how even different versions are detected. Any insights (document/video) on the detailed internals is appreciated.
It's a complex question, but a basic ideas are as follows (quite a bit simplified):
records are stored in immutable micro-partitions on S3
a table is a list of micro-partitions
when a record is modified
its old micro-partition is marked as inactive (from that moment),
a new micro-partition is created, containing the modified record, but also other records from that micro-partition.
the new micro-partition is added to the table's list (marked as active from that moment)
inactive micro-partitions are not deleted for some time, allowing time-travel
So Snowflake doesn't need a record key, as each record is stored in only one file active at a given time.
The impact of performing updates on querying is marginal, the only visible impact might be that the files need to be fetched from S3 and cached on the warehouses.
For more info, I'd suggest going to Snowflake forums and asking there.