Is there anyway to move less frequent S3 buckets to glacier automatically? I mean to say, some option or service searches on S3 with least access date and then assign lifecycle policy to them, so they can be moved to glacier? or I have to write a program to do this? If this not possible, is there anyway to assign lifecycle policy to all the buckets at once?
Looking for some feedback. Thank you.
No this isn't possible as a ready made feature. However, there is something that might help, Amazon S3 Analytics
This produces a report of which items in your buckets are less frequently used. This information can be used find items that should be archived.
It could be possible to use the S3 Analytics output as input for a script to tag items for archiving. However, this complete feature (find infrequently used items and then archive them) doesn't seem to be available as a standard product
You can do this by adding a tag or prefix to your buckets.
Create lifecycle rule to target that tag or prefix to group your buckets together and assign/apply a single lifecycle policy.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html
Related
We need to implement an expiration of X days of all customer data due to contractual obligations. Not too big of a deal, that's about as easy as it gets.
But at the same time, some customers' projects have files with metadata. Perhaps dataset definitions which most definitely DO NOT need to go away. We have free reign to tag or manipulate any of the data in any way we see fit. Since we have 500+ S3 buckets, we need a somewhat global solution.
Ideally, we would simply set an expiration on the bucket and another rule for the metadata/ prefix. Except then we have a rule overlap and metadata/* files will still get the X day expiration that's been applied to the entire bucket.
We can forcefully tag all objects NOT in metadata/* with something like allow_expiration = true using Lambda. While not out of the question, I would like to implement something a little more built-in with S3.
I don't think there's a way to implement what I'm after without using some kind of tagging and external script. Thoughts?
If you've got a free hand on tagging the object, you could use both prefix and / or a tag filter with S3 lifecycle.
You can filter objects by key prefix, object tags, or a combination of both (in which case Amazon S3 uses a logical AND to combine the filters).
See Lifecycle Filter Rules
You could automate the creation and management of your lifecycle rules with IaC, for example, terraform.
See S3 Bucket Lifecycle Configuration with Terraform
There's a useful blog on how to manage these dynamically here.
What's more, using tags has a number of additional benefits:
Object tags enable fine-grained access control of permissions. For
example, you could grant an IAM user permissions to read-only objects
with specific tags.
Object tags enable fine-grained object lifecycle management in which
you can specify a tag-based filter, in addition to a key name prefix,
in a lifecycle rule.
When using Amazon S3 analytics, you can configure filters to group
objects together for analysis by object tags, by key name prefix, or
by both prefix and tags.
You can also customize Amazon CloudWatch metrics to display
information by specific tag filters.
Source and more on how to set tags to multiple Amazon S3 object with a single request.
Is there a way similar to an S3 lifecycle policy (or just an s3 bucket policy) that will automatically delete objects in a bucket older than x days and with a given file extension?
Depending on the extension it might be a bucket wide delete action or only delete objects under certain prefixes.
with a given file extension
Sadly you can't do this with S3 lifecycles, as they only work based on prefix, not suffix such as an extension.
This means you need a custom solution for that problem. Depending on exact nature of the issue (number of files, how frequently do you want to perform the deletion operation), there are several ways for doing this. They including running a single lambda on schedule, S3 Batch operations, using DynamoDB to store the metadata, and so on.
If you are uploading your files using S3 (PutObject, etc...), you can tag your objects and then use the tag to delete them using the S3 lifecycles.
I just want to know the best strategy for backup of the files stored on s3 bucket. I can think of 2 options - enabling versioning and (periodically e.g. once a day) syncing to new s3 bucket. The files are created by Athena CTAS queries every day and the file names are randomly generated. If I delete the files by accident, I need to restore it from the backup. Some advantages of having another s3 bucket is that it protects from accidental delete of the original s3 bucket itself and another one is easy restore process of the deleted file(s). On another hand, versioning looks simple and most preferred. I could not find my articles talking about the pros and cons of these 2 approaches and hence this question/debate. I just want to know the pros and cons of each approach.
Thanks,
Sree
You typically do not need to "backup" Amazon S3 objects because they are replicated in multiple data centers. However, as yu point out you might want a method to handle "accidental deletion".
One option is to prevent deletion by using Object Locking or a Bucket Policy. If people aren't permitted to delete objects, then there is no need for a backup.
Of the two options you present, Versioning is a better option because you are not duplicating objects. This means that you are not paying twice for storage. It is also simpler because there is no need to configure replication.
Buckets cannot be deleted unless they are empty.
Is it better to have multiple s3 buckets per category of uploads or one bucket with sub folders OR a linked s3 bucket? I know for sure there will be more user-images than there will be profille-pics and that there is a 5TB limit per bucket and 100 buckets per account. I'm doing this using aws boto library and https://github.com/amol-/depot
Which is the structure my folders in which of the following manner?
/app_bucket
/profile-pic-folder
/user-images-folder
OR
profile-pic-bucket
user-images-bucket
OR
/app_bucket_1
/app_bucket_2
The last one implies that its really a 10TB bucket where a new bucket is created when the files within bucket_1 exceeds 5TB. But all uploads will be read as if in one bucket. Or is there a better way of doing what I'm trying to do? Many thanks!
I'm not sure if this is correct... 100 buckets per account?
https://www.reddit.com/r/aws/comments/28vbjs/requesting_increase_in_number_of_s3_buckets/
Yes, there is actually a 100 bucket limit per account. I asked the reason for that to an architect in an AWS event. He said this is to avoid people hosting unlimited static websites on S3 as they think this may be abused. But you can apply for an increase.
By default, you can create up to 100 buckets in each of your AWS
accounts. If you need additional buckets, you can increase your bucket
limit by submitting a service limit increase.
Source: http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html
Also, please note that there are actually no folders in S3, just a flat file structure:
Amazon S3 has a flat structure with no hierarchy like you would see in
a typical file system. However, for the sake of organizational
simplicity, the Amazon S3 console supports the folder concept as a
means of grouping objects. Amazon S3 does this by using key name
prefixes for objects.
Source: http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
Finally, the 5TB limit only applies to a single object. There is no limit on the number of objects or total size of the bucket.
Q: How much data can I store?
The total volume of data and number of objects you can store are
unlimited.
Source: https://aws.amazon.com/s3/faqs/
Also the documentation states there is no performance difference between using a single bucket or multiple buckets so I guess both option 1 and 2 would be suitable for you.
Hope this helps.
Simpler Permission with Multiple Buckets
If the images are used in different use cases, using multiple buckets will simplify the permissions model, since you can give clients/users bucket level permissions instead of directory level permissions.
2-way doors and migrations
On a similar note, using 2 buckets is more flexible down the road.
1 to 2:
If you switch from 1 bucket to 2, you now have to move all clients to the new set-up. You will need to update permissions for all clients, which can require IAM policy changes for both you and the client. Then you can move your clients over by releasing a new client library during the transition period.
2 to 1:
If you switch from 2 buckets to 1 bucket, your clients will already have access to the 1 bucket. All you need to do is update the client library and move your clients onto it during the transition period.
*If you don't have a client library than code changes are required in both cases for the clients.
According to the documentation objects can only be deleted permanently by also supplying their version number.
I had a look at Python's Boto and it seems simple enough for small sets of objects. But if I have a folder that contains 100 000 objects, it would have to delete them one by one and that would take some time.
Is there a better way to go about this?
An easy way to delete versioned objects in an Amazon S3 bucket is to create a lifecycle rule. The rule activates on a batch basis (Midnight UTC?) and can delete objects within specified paths and it knows how to handle versioned objects.
See:
Lifecycle Configuration for a Bucket with Versioning
Such deletions do not count towards the API call usage count, so it can be cheaper, too!