I just want to know the best strategy for backup of the files stored on s3 bucket. I can think of 2 options - enabling versioning and (periodically e.g. once a day) syncing to new s3 bucket. The files are created by Athena CTAS queries every day and the file names are randomly generated. If I delete the files by accident, I need to restore it from the backup. Some advantages of having another s3 bucket is that it protects from accidental delete of the original s3 bucket itself and another one is easy restore process of the deleted file(s). On another hand, versioning looks simple and most preferred. I could not find my articles talking about the pros and cons of these 2 approaches and hence this question/debate. I just want to know the pros and cons of each approach.
Thanks,
Sree
You typically do not need to "backup" Amazon S3 objects because they are replicated in multiple data centers. However, as yu point out you might want a method to handle "accidental deletion".
One option is to prevent deletion by using Object Locking or a Bucket Policy. If people aren't permitted to delete objects, then there is no need for a backup.
Of the two options you present, Versioning is a better option because you are not duplicating objects. This means that you are not paying twice for storage. It is also simpler because there is no need to configure replication.
Buckets cannot be deleted unless they are empty.
Related
I have an S3 bucket that is 9TB and I want to copy it over to another AWS account.
What would be the fastest and most cost efficient way to copy it?
I know I can rsync them and also use S3 replication.
Rsync I think will take too long and I think be a bit pricey.
I have not played with S3 replication so I am not sure of its speed and cost.
Are there any other methods that I might not be aware of?
FYI - The source and destination buckets will be in the same region (but different accounts).
There is no quicker way to do it then using sync and I do not believe it is that pricey. You do not mention the number of files you are copying though.
You will pay $0.004 / 10,000 requests on the GET operations on the files you are copying and then $0.005 / 1,000 requests on the PUT operations on the files you are writing. Also, I believe you won't pay data transfer costs if this is in the same region.
If you want to speed this up you could use multiple sync jobs if the bucket has a way of being logically divisible i.e. s3://examplebucket/job1 and s3://examplebucket/job2
You can use S3 Batch Operations to copy large quantities of objects between buckets in the same region.
It can accept a CSV file containing a list of objects, or you can use the output of Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
While copying, it can also update tags, metadata and ACLs.
See: Cross-account bulk transfer of files using Amazon S3 Batch Operations | AWS Storage Blog
I wound up finding the page below and used replication with the copy to itself method.
https://aws.amazon.com/premiumsupport/knowledge-center/s3-large-transfer-between-buckets/
does s3 have snapshots? how should I solve a problem where something would, for example, delete all my s3 data, how do I backup?
There are a couple options.
1. Enable versioning on your bucket. Every version of the objects will be retained. Deleting an object will just add a "delete marker" to indicate the object was deleted. You will pay for the storage of all the versions. Note that versions can also be deleted.
2. If you are just worried about deletion you can add a bucket policy to prevent deletion. You can also use some of the newer hold options.
3. You can use cross region replication to copy the objects to a bucket in a different region and optionally a different account.
Is there anyway to move less frequent S3 buckets to glacier automatically? I mean to say, some option or service searches on S3 with least access date and then assign lifecycle policy to them, so they can be moved to glacier? or I have to write a program to do this? If this not possible, is there anyway to assign lifecycle policy to all the buckets at once?
Looking for some feedback. Thank you.
No this isn't possible as a ready made feature. However, there is something that might help, Amazon S3 Analytics
This produces a report of which items in your buckets are less frequently used. This information can be used find items that should be archived.
It could be possible to use the S3 Analytics output as input for a script to tag items for archiving. However, this complete feature (find infrequently used items and then archive them) doesn't seem to be available as a standard product
You can do this by adding a tag or prefix to your buckets.
Create lifecycle rule to target that tag or prefix to group your buckets together and assign/apply a single lifecycle policy.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html
According to the documentation objects can only be deleted permanently by also supplying their version number.
I had a look at Python's Boto and it seems simple enough for small sets of objects. But if I have a folder that contains 100 000 objects, it would have to delete them one by one and that would take some time.
Is there a better way to go about this?
An easy way to delete versioned objects in an Amazon S3 bucket is to create a lifecycle rule. The rule activates on a batch basis (Midnight UTC?) and can delete objects within specified paths and it knows how to handle versioned objects.
See:
Lifecycle Configuration for a Bucket with Versioning
Such deletions do not count towards the API call usage count, so it can be cheaper, too!
I am working on an app which uses S3 to store important documents. These documents need to be backed up on a daily, weekly rotation basis much like how database backups are maintained.
Does S3 support a feature where a bucket can be backup up into multiple buckets periodically or perhaps in Amazon Glacier. I want to avoid using an external service as much as possible, and was hoping S3 had some mechanism to do this, as its a common usecase.
Any help would be appreciated.
Quote from Amazon S3 FAQ about durability:
Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years
These numbers mean, first of all, that they are almost unbeatable. In other words, your data is safe in Amazon S3.
Thus, the only reason why you would need to backup your data objects is to prevent their accidental loss (by your own mistake). To solve this problem Amazon S3 enables versioning of S3 objects. Enable this feature on your S3 bucket and you're safe.
ps. Actually, there is one more possible reason - cost optimization. Amazon Glacier is cheaper than S3. I would recommend to use AWS Data Pipeline to move S3 data to Glacier routinely.
Regarding Glacier, you can make settings on your bucket to backup (old) s3 data to glaciaer if it is older than specified duration. This can save you cost if you want infrequently accessed data to be archived.
In s3 bucket there are lifecycle rules using which we can automatically move data from s3 to glaciers.
but if you want to access these important documents frequently from backup then you can also use another S3 bucket for backup your data.This backup can be scheduled using AWS datapipeline daily,weekly etc.
*Glaciers are cheaper than S3 as data is stored in compressed format in galaciers.
I created a Windows application that will allow you to schedule S3 bucket backups. You can create three kinds of backups: Cumulative, Synchronized and Snapshots. You can also include or exclude root level folders and files from your backups. You can try it free with no registration at https://www.bucketbacker.com