backing up s3 buckets best practice - amazon-web-services

I want to do a daily backup for s3 buckets. I was wondering if anyone knew what was best practice?
I was thinking of using a lambda function to copy contents from one s3 bucket to another as the s3 bucket is updated. But that won't mitigate against an s3 failure. How do I copy contents from one s3 bucket to another Amazon service like Glacier using lamda? What's the best practice here for backing up s3 buckets?
NOTE: I want to do a backup not archive (where content is deleted afterward)

Look into S3 cross-region replication to keep a backup copy of everything in another S3 bucket in another region. Note that you can even have the destination bucket be in a different AWS Account, so that it is safe even if your primary S3 account is hacked.
Note that a combination of Cross Region Replication and S3 Object Versioning (which is required for replication) will allow you to keep old versions of your files available even if they are deleted from the source bucket.
Then look into S3 lifecycle management to transition objects to Glacier to save storage costs.

Related

Do I need to setup Glacier Vault to archive data from S3?

I'm really new to AWS and quite confused on the purpose of Glacier vault, when I can archive my objects thru S3 via lifecycle rule? so do I have to first setup Glacier Vault for me to archive my objects?
Once upon a time, there was a service called Amazon Glacier. It was very low-cost, but it was very painful to use. Every request (even listing the contents of a vault) took a long time (eg make a request, come back an hour later to get the result).
Then, the clever people in Amazon S3 realized that they could provide a more friendly interface to Glacier. By simpler changing the storage class of objects in S3 to Glacier, they would move the files to their own Glacier vault and save you all the hassle.
Then, the S3 team introduced Glacier Deep Archive, which is only available via Amazon S3 and is even lower cost than Glacier itself!
The children rejoiced and all cried out in unison... "We will now only use Glacier via S3. We will never go direct to Glacier again!"
No you don't have to. You use Glacier Vaults if you want to use extra features that S3 Glacier service provides, such as Vault Lock Policies and/or Vault Access Policies.
For using just the Glacier storage, you can use Amazon S3 service and lifecycle rules.

Copy S3 files to another S3 in a different account as they land

I want to set up a bucket for testing, where I pull in a previous month of data from an S3 bucket in a different AWS account, and then continue to consume data from that S3 bucket as it lands.
All the documentation I am seeing talks about full copies or syncs between buckets, and there is far too much data in the initial bucket for that. I need to be able to just pull fresh data in as it lands. I'm not sure what the best method for this would be, as it would be likely a several thousand files a day.
You should look at S3 replication for your use case to copy data from primary bucket to secondary bucket. S3 replication will copy only the new data from the primary bucket to secondary bucket.
However the secondary bucket will have to be in a different AWS region. (Hopefully thats not the limiting case for you)
check this link for setup (https://medium.com/#chrisjerry9618/s3-cross-region-replication-2e20f2dc86e0)
Upvote this or mark as answer for others to get help

Amazon AWS Athena S3 and Glacier Mixed Bucket

Amazon Athena Log Analysis Services with S3 Glacier
We have petabytes of data in S3. We are https://www.pubnub.com/ and we store usage data in S3 of our network for billing purposes. We have tab delimited log files stored in an S3 bucket. Athena is giving us a HIVE_CURSOR_ERROR failure.
Our S3 bucket is setup to automatically push to AWS Glacier after 6 months. Our bucket has S3 files hot and ready to read in addition to the Glacier backup files. We are getting access errors from Athena because of this. The file referenced in the error is a Glacier backup.
My guess is the answer will be: don't keep glacier backups in the same bucket. We don't have this option with ease due to our data volume sizes. I believe Athena will not work in this setup and we will not be able to use Athena for our log analysis.
However if there is a way we can use Athena, we would be thrilled. Is there a solution to HIVE_CURSOR_ERROR and a way to skip Glacier files? Our s3 bucket is a flat bucket without folders.
The S3 file object name shown in the above and below screenshots is omitted from the screenshot. The file reference in the HIVE_CURSOR_ERROR is in fact the Glacier object. You can see it in this screenshot of our S3 Bucket.
Note I tried to post on https://forums.aws.amazon.com/ but that was no bueno.
The documentation from AWS dated May 16 2017 states specifically that Athena does not support the GLACIER storage class:
Athena does not support different storage classes within the bucket specified by the LOCATION
clause, does not support the GLACIER storage class, and does not support Requester Pays
buckets. For more information, see Storage Classes, Changing the Storage Class of an Object in
|S3|, and Requester Pays Buckets in the Amazon Simple Storage Service Developer Guide.
We are also interested in this; if you get it to work, please let us know how. :-)
Since the release of February 18, 2019 Athena will ignore objects with the GLACIER storage class instead of failing the query:
[…] As a result of fixing this issue, Athena ignores objects transitioned to the GLACIER storage class. Athena does not support querying data from the GLACIER storage class.
You must have an S3 bucket to work with. In addition, the AWS account that you use to initiate a S3 Glacier Select job must have write permissions for the S3 bucket. The Amazon S3 bucket must be in the same AWS Region as the vault that contains the archive object that is being queried.
S3 glacier select runs the query and stores in S3 bucket
Bottom line, you must move the data into an S3 buck to use teh S3 glacier select statement. Then use Athena on the 'new' S3 bucket.

How to keep both data on aws s3 and glacier

I want to keep a backup of an AWS s3 bucket. If I use Glacier, it will archive the files from the bucket and moved to the Glacier but it will also delete the files from s3. I don't want to delete the files from s3. One option is to try with EBS volume. You can mount the AWS s3 bucket with s3fs and copy it to the EBS volume. Another way is do an rsync of the existing bucket to a new bucket which will act as a clone. Is there any other way ?
What you are looking for is cross-region replication:
https://aws.amazon.com/blogs/aws/new-cross-region-replication-for-amazon-s3/
setup versioning and setup the replication.
on the target bucket you could setup a policy to archive to Glacier (or you could just use the bucket as a backup as is).
(this will only work between 2 regions, i.e. buckets cannot be in the same region)
If you want your data to be present in both primary and backup locations then this is more of a data replication use case.
Consider using AWS Lambda which is an event driven compute service.
You can write a simple piece of code to copy the data wherever you want. This will execute every time there is a change in S3 bucket.
For more info check the official documentation.

Copying multiple files in large volume between two s3 buckets which are in the different regions

I need to copy a large chunk of data, around 300 GB of files from say bucket A which is in us-east region and to bucket B which is in ap-southeast region. Also I need to change the structure of the bucket. Like I need to push the files to different folders on bucket B according to the image name which is in the bucket A. I tried to using AWS Lambda but it's not available in ap-southeast.
Also how much would it cost since data will be transferred between regions?
Method
The AWS Command-Line Interface (CLI) has the aws s3 cp command that can be used to move objects between buckets (even in different regions), and can rename them at the same time.
aws s3 cp s3://bucket-in-us/foo/bar.txt s3://bucket-in-ap/foo1/foo2/bar3.txt
There is also the aws s3 sync option that can be used to synchronize content between two buckets, but that doesn't help your requirement to rename objects.
Cost
Data Transfer charges from US regions to another region are shown on the Amazon S3 pricing page as US$0.02/GB.
Use bucket replication and then create another bucket in your target region and do your S3 object key manipulation.
Read more on S3 cross-region replication.