According to the documentation objects can only be deleted permanently by also supplying their version number.
I had a look at Python's Boto and it seems simple enough for small sets of objects. But if I have a folder that contains 100 000 objects, it would have to delete them one by one and that would take some time.
Is there a better way to go about this?
An easy way to delete versioned objects in an Amazon S3 bucket is to create a lifecycle rule. The rule activates on a batch basis (Midnight UTC?) and can delete objects within specified paths and it knows how to handle versioned objects.
See:
Lifecycle Configuration for a Bucket with Versioning
Such deletions do not count towards the API call usage count, so it can be cheaper, too!
Related
Is there a way similar to an S3 lifecycle policy (or just an s3 bucket policy) that will automatically delete objects in a bucket older than x days and with a given file extension?
Depending on the extension it might be a bucket wide delete action or only delete objects under certain prefixes.
with a given file extension
Sadly you can't do this with S3 lifecycles, as they only work based on prefix, not suffix such as an extension.
This means you need a custom solution for that problem. Depending on exact nature of the issue (number of files, how frequently do you want to perform the deletion operation), there are several ways for doing this. They including running a single lambda on schedule, S3 Batch operations, using DynamoDB to store the metadata, and so on.
If you are uploading your files using S3 (PutObject, etc...), you can tag your objects and then use the tag to delete them using the S3 lifecycles.
I just want to know the best strategy for backup of the files stored on s3 bucket. I can think of 2 options - enabling versioning and (periodically e.g. once a day) syncing to new s3 bucket. The files are created by Athena CTAS queries every day and the file names are randomly generated. If I delete the files by accident, I need to restore it from the backup. Some advantages of having another s3 bucket is that it protects from accidental delete of the original s3 bucket itself and another one is easy restore process of the deleted file(s). On another hand, versioning looks simple and most preferred. I could not find my articles talking about the pros and cons of these 2 approaches and hence this question/debate. I just want to know the pros and cons of each approach.
Thanks,
Sree
You typically do not need to "backup" Amazon S3 objects because they are replicated in multiple data centers. However, as yu point out you might want a method to handle "accidental deletion".
One option is to prevent deletion by using Object Locking or a Bucket Policy. If people aren't permitted to delete objects, then there is no need for a backup.
Of the two options you present, Versioning is a better option because you are not duplicating objects. This means that you are not paying twice for storage. It is also simpler because there is no need to configure replication.
Buckets cannot be deleted unless they are empty.
Looking at https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html,
Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all regions with one caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write.
My understanding from this is that if I create a new object and I haven't checked for its existence beforehand, it should be available immediately (e.g., show up in list requests).
But the above link also says
...you might observe the following behaviors:
A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.
which contradicts the first statement as it basically says read-after-write consistency is always eventual for PUTS.
I read it as:
Amazon guarantees that a ReadObject request (GET and HEAD) will succeed (read-after-write consistency) for any newly PUT object, assuming you haven't requested the object before
Amazon doesn't guarantee that a ListBucket request will be immediately consistent for any newly PUT object, but rather the new object will eventually show up in a ListBucket request (eventual consistency)
S3 is now strongly consistent, all S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent. What you write is what you will read, and the results of a LIST will be an accurate reflection of what’s in the bucket. This applies to all existing and new S3 objects, works in all regions, and is available to you at no extra charge! There’s no impact on performance, you can update object hundreds of times per second if you’d like, and there are no global dependencies.
https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
We are using versioned s3 bucket for our use-case.
We'll be frequently updating the same file.
I would like to know how many versions of same file can the s3 bucket handle.
I wonder whether the oldest version will be removed if there is a limit the max files that versioned s3 can handle.
There is no easy reachable limits for different versions, but you'll be charged for each stored version.
https://aws.amazon.com/blogs/aws/amazon-s3-enhancement-versioning/
Normal S3 pricing applies to each version of an object. You can store any number of versions of the same object, so you may want to implement some expiration and deletion logic if you plan to make use of this feature.
So if you are going to update file frequently you should consider setting some S3 Lifecycle in advance.
https://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html
Is there anyway to move less frequent S3 buckets to glacier automatically? I mean to say, some option or service searches on S3 with least access date and then assign lifecycle policy to them, so they can be moved to glacier? or I have to write a program to do this? If this not possible, is there anyway to assign lifecycle policy to all the buckets at once?
Looking for some feedback. Thank you.
No this isn't possible as a ready made feature. However, there is something that might help, Amazon S3 Analytics
This produces a report of which items in your buckets are less frequently used. This information can be used find items that should be archived.
It could be possible to use the S3 Analytics output as input for a script to tag items for archiving. However, this complete feature (find infrequently used items and then archive them) doesn't seem to be available as a standard product
You can do this by adding a tag or prefix to your buckets.
Create lifecycle rule to target that tag or prefix to group your buckets together and assign/apply a single lifecycle policy.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html