Snapshot based S3 bucket backup solution [closed]

Snapshot based S3 bucket backup solution [closed] - amazon-web-services

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 1 year ago.
Improve this question
I am trying to build a backup system for some important data in my AWS S3 bucket.
Among the options that I explored were versioning from which individual objects can be recovered to an earlier state.
This would definitely help in the case of accidental deletions.
But the problem here is in situations where there's a data corruption happening because of some wrong code that was introduced or something similar, in order to retrieve the system to an earlier state a proper snapshot based backup solution will be required in addition to versioning.
This would also help in a situation where say the whole bucket was deleted accidentally, or versioning got turned off and some data got deleted later.
The current option I was thinking of was to use an EC2 instance to copy the data daily or at predefined intervals to local drive(using aws s3 sync or aws s3 cp) and then upload it under the particular days folder to another S3 bucket. I was thinking of maintaining a life-cycle rule to expire the backups after say a week.
I don't think this is very efficient though because the buckets could hold about 100GB data later as traffic increases into the application.
I wanted some validation from someone who might have done something similar if this is the right way to proceed, or if there's some S3 or AWS feature that can be used to make this simpler.

Traditionally, backups are used in case a storage device is corrupted. However, Amazon S3 replicates data automatically to multiple storage devices, so this takes care of durability.
For data corruption (eg an application destroys the contents of a file), Versioning is the best option because S3 will retain previous versions of an object whether an object is updated (overwritten). Object Lifecycle Management can be used to delete versions after a certain number of versions or after a given period of time.
If you concerned that versioning might be turned off (suspended) or a whole bucket was accidentally deleted, you can use S3 replication to duplicate the contents of the bucket to another bucket. The other bucket can even be in a different region or a different AWS Account, which means that nobody in the primary account would have permission to delete data in the secondary (replication) account. This is a common practice to ensure critical business data is not lost.
If you want the ability to restore multiple objects to a point-in-time ("retrieve the system to an earlier state"), you can use traditional backup software that is S3-aware. For example, MSP Backup (formerly CloudBerry Lab) has backup software that can move data between S3 buckets and on-premises storage (or just within S3), with normal point-in-time restore capabilities.

Related

AWS S3 batch operation - Got Dinged Pretty Hard

We used the newly introduced AWS S3 batch operation to back up our S3 bucket, which had about 15 TB of data, to Glacier S3 . Prior to backing up we had estimated the bandwidth and storage costs and also taken into account mandatory 90 day storage requirement for Glacier.
However, the actual costs turned out to be massive compared to our estimated cost. We somehow overlooked the UPLOAD requests costs which runs at $0.05 per 1000 requests. We have many millions of files and each file upload was considered as a request and we are looking at several thousand dollars worth of spend :(
I am wondering if there was any way to avoid this?

The concept of "backup" is quite interesting.
Traditionally, where data was stored on one disk, a backup was imperative because it's not good to have a single point-of-failure.
Amazon S3, however, stores data on multiple devices across multiple Availability Zones (effectively multiple data centers), which is how they get their 99.999999999% durability and 99.99% availability. (Note that durability means the likelihood of retaining the data, which isn't quite the same as availability which means the ability to access the data. I guess the difference is that during a power outage, the data might not be accessible, but it hasn't been lost.)
Therefore, the traditional concept of taking a backup in case of device failure has already been handled in S3, all for the standard cost. (There is an older Reduced Redundancy option that only copied to 2 AZs instead of 3, but that is no longer recommended.)
Next comes the concept of backup in case of accidental deletion of objects. When an object is deleted in S3, it is not recoverable. However, enabling versioning on a bucket will retain multiple versions including deleted objects. This is great where previous histories of objects need to be kept, or where deletions might need to be undone. The downside is that storage costs include all versions that are retained.
There is also the new object lock capabilities in S3 where objects can be locked for a period of time (eg 3 years) without the ability to delete them. This is ideal for situations where information must be retained for a period and it avoids accidental deletion. (There is also a legal hold capability that is the same, but can be turned on/off if you have appropriate permissions.)
Finally, there is the potential for deliberate malicious deletion if an angry staff member decides to take revenge on your company for not stocking their favourite flavour of coffee. If an AWS user has the necessary permissions, they can delete the data from S3. To guard against this, you should limit who has such permissions and possibly combine it with versioning (so they can delete the current version of an object, but it is actually retained by the system).
This can also be addressed by using Cross-Region Replication of Amazon S3 buckets. Some organizations use this to copy data to a bucket owned by a different AWS account, such that nobody has the ability to delete data from both accounts. This is closer to the concept of a true backup because the copy is kept separate (account-wise) from the original. The extra cost of storage is minimal compared to the potential costs if the data was lost. Plus, if you configure the replica bucket to use the Glacier Deep Archive storage class, the costs can be quite low.
Your copy to Glacier is another form of backup (and offers cheaper storage than S3 in the long-term), but it would need to be updated at a regular basis to be a continuous backup (eg by using backup software that understands S3 and Glacier). The "5c per 1000 requests" cost means that it is better used for archives (eg large zip files) rather than many, small files.
Bottom line: Your need for a backup might be as simple as turning on Versioning and limiting which users can totally delete an object (including all past versions) from the bucket. Or, create a bucket replica and store it in Glacier Deep Archive storage class.

If AWS is already backing up Dynamo, what's the point of doing my own backups?

We have a completely serverless application, with only lambdas and DynamoDB.
The lambdas are running in two regions, and the originals are stored in Cloud9.
DynamoDB is configured with all tables global (bidirectional multi-master replication across the two regions), and the schema definitions are stored in Cloud9.
The only data loss we need to worry about is DynamoDB, which even if it crashed in both regions is presumably diligently backed up by AWS.
Given all of that, what is the point of classic backups? If both regions were completely obliterated, we'd likely be out of business anyway, and anything short of that would be recoverable from AWS.

Not all AWS regions support backup and restore functionality. You'll need to roll your own solution for backups in unsupported regions.
If all the regions your application runs in supports the backup functionality, you probably don't need to do it yourself. That is the point of going serverless. You let the platform handle simple DevOps tasks.

Having redundancy with regional or optionally cross-regional replication for DynamoDB provides mainly the durability, availability and fault tolerance for your data storage. However along with these inbuilt capabilities, still there can be the need for having backups.
For instance, if there is a data corruption due to an external threat (Like an attack) or based on an application malfunction, still you might want to restore the data back. This is one place where having backups is useful to restore the data back to a recent point of time.
There can also be compliance related requirement, which will require taking backups of your database system.
Another use case is when there is a need to create new DynamoDB tables for your build pipeline and quality assurance, it is more practical to re-use an already made snapshot of data from a backup rather taking a copy from the live database (Since it can consume the IOPS provisioned, affecting the application behaviors).

What are possible ways to access Amazaon S3 data if S3 outage happens?

Can some one help me in understanding the S3 outage usecase here.
The probability of S3 outage is very less, but in case if this happens, what are the ways we can access data that sits in S3.
I know that there is one possibility, that is cross region replication, that works for new files, that I am going to put in my s3 bucket, if I enable it now. What happen to old files, I know if I go and upload all those historical files also to the other region, then it works.
Then again the same question, if both the regions went down, then what?
I am sure others would have thought of this. Any inputs on this.

From Protecting Data in Amazon S3:
Objects are redundantly stored on multiple devices across multiple facilities in an Amazon S3 region. To help better ensure data durability, Amazon S3 PUT and PUT Object copy operations synchronously store your data across multiple facilities before returning SUCCESS. Once the objects are stored, Amazon S3 maintains their durability by quickly detecting and repairing any lost redundancy.
...
Backed with the Amazon S3 Service Level Agreement
Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year
Designed to sustain the concurrent loss of data in two facilities
So, if you're still not happy with all those statements, how can you access your data in an outage?
If your data is in only one region, and the region is not accessible, then your data is not accessible. Note, however, that an external network connectivity problem could prevent access to Amazon S3, yet Amazon S3 might still be accessible from Amazon EC2 instances in the same region.
Cross-region replication will copy your data to another Amazon S3 region. It requires versioning to be activated. To copy any files that exist prior to activating cross-region replication, use the sync command in the AWS Command-Line Utility (CLI), eg:
aws s3 sync s3://bucket1/folder s3://bucket2/folder
Each AWS region operates independently, so the possibility of multiple regions suffering outages would presumably be even less likely.
If you are feeling particularly paranoid, you could copy your data to another cloud provider (Azure, Google, Rackspace, etc). There are tools that can assist:
CloudBerry Cloud Migrator
AzureCopy
...and no doubt many more!

Deleting a large number of Versions from Amazon S3 Bucket

Ok so I have a slight problem I have had a back up program running on a NAS to an Amazon S3 bucket and have had versioning turned enabled on the bucket. The NAS stores around 900GB of data.
I've had this running for a number of months now, and have been watching the bill go up and up for the cost of Amazons Glacier service (which my versioning lifecycle rules stored objects in). The cost has eventually got so high that I have had to suspend Versioning on the bucket in an effort to stop any more costs.
I now have a large number of versions on all our objects screenshot example of one file:
I have two questions:
I'm currently looking for a way to delete this large number of versioned files, from Amazons own documentation it would appear I have to delete each version individually is this correct? If so what is the best way to achieve this? I assume it would be some kind of script which would have to list each item in a bucket and issue a DELETEVERSION to each versioned object? This would be a lot of requests and I guess that leads onto my next question.
What are the cost implications of deleting a large amount of Glacier objects in this way? It seems cost of deletion of objects in Glacier is expensive, does this also apply to versions created in S3?
Happy to provide more details if needed,
Thanks

Deletions from S3 are free, even if S3 has migrated the object to glacier, unless the object has been in glacier for less than 3 months, because glacier is intended for long-term storage. In that case, only, you're billed for the amount of time left (e.g., for an object stored for only 2 months, you will be billed an early deletion charge equal to 1 more month).
You will still have to identify and specify the versions to delete, but S3 accepts up to 1000 objects or versions (max 1k entites) in a single multi-delete request.
http://docs.aws.amazon.com/AmazonS3/latest/API/multiobjectdeleteapi.html

What is maximum Amazon S3 replication time on file upload?

Background
We use Amazon S3 in our project as a storage for files uploaded by clients.
For technical reasons, we upload a file to S3 with a temporary name, then process its contents and rename the file after it has been processed.
Problem
The 'rename' operation fails time after time with 404 (key not found) error, although the file being renamed had been uploaded successfully.
Amazon docs mention this problem:
Amazon S3 achieves high availability by replicating data across multiple servers within Amazon's data centers.
If a PUT request is successful, your data is safely stored. However, information about the changes must replicate across Amazon S3, which can take some time, and so you might observe the following behaviors:
We implemented a kind of polling as workaround: retry the 'rename' operation until it succeeds.
The polling stops after 20 seconds.
This workaround works in most cases: the file gets replicated within few seconds.
But sometimes — very rarely — 20 seconds are not enough; the replication in S3 takes more time.
Questions
What is the maximum time you observed between a successful PUT operation and complete replication on Amazon S3?
Does Amazon S3 offer a way to 'bypass' replication? (Query 'master' directly?)

Update: this answer uses some older terminology, which i have left in place, for the most part. AWS has changed the friendly name of "US-Standard" to be more consistent with the naming of other regions, but its regional endpoint for IPv4 still has the unusual name s3-external-1.amazonaws.com.
The us-east-1 region of S3 has an IPv4/IPv6 "dual stack" endpoint that follows the standard convention of s3.dualstack.us-east-1.amazonaws.com and if you are IPv6 enabled, this endpoint seems operationally-equivalent to s3-external-1 as discussed below.
The documented references to geographic routing of requests for this region seem to have largely disappeared, without much comment, but anecdotal evidence suggests that the following information is still relevant to that region.
Q. Wasn’t there a US Standard region?
We renamed the US Standard Region to US East (Northern Virginia) Region to be consistent with AWS regional naming conventions.
— https://aws.amazon.com/s3/faqs/#regions
Buckets using the S3 Transfer Acceleration feature use a global-style endpoint of ${bucketname}.s3-accelerate.amazonaws.com and it is not yet evident how this endpoint behaves with regard to us-east-1 buckets and eventual consistency, though it stands to reason that other regions should not be affected by this feature, if enabled. This feature improves transfer throughput for users who are more distant from the bucket by routing requests to the same S3 endpoints but proxying through the AWS "Edge Network," the same system that powers CloudFront. It is, essentially, a self-configuring path through CloudFront but without caching enabled. The acceleration comes from optimized network stacks and keeping the traffic on the managed AWS network for much of its path across the Internet. As such, this feature should have no impact on consistency, if you enable and use it on a bucket... but, as I mentioned, how it interacts with us-east-1 buckets is not yet known.
The US-Standard (us-east-1) region is the oldest, and presumably largest, region of S3, and does play by some different rules than the other, newer regions.
An important and relevant difference is the consistency model.
Amazon S3 buckets in [all regions except US Standard] provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES. Amazon S3 buckets in the US Standard region provide eventual consistency.
http://aws.amazon.com/s3/faqs/
This is why I assumed you were using US Standard. The behavior you described is consistent with that design constraint.
You should be able to verify that this doesn't happen with a test bucket in another region... but, because data transfer from EC2 to S3 within the same region is free and very low latency, using a bucket in a different region may not be practical.
There is another option that is worth trying, has to do with the inner-workings of US-Standard.
US Standard is in fact geographically-distributed between Virginia and Oregon, and requests to "s3.amazonaws.com" are selectively routed via DNS to one location or another. This routing is largely a black box, but Amazon has exposed a workaround.
You can force your requests to be routed only to Northern Virginia by changing your endpoint from "s3.amazonaws.com" to "s3-external-1.amazonaws.com" ...
http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
... this is speculation on my part, but your issue may be exacerbated by geographic routing of your requests, and forcing them to "s3-external-1" (which, to be clear, is still US-Standard), might improve or eliminate your issue.
Update: The advice above has officially risen above speculation, but I'll leave it for historical reference. About a year I wrote the above, Amazon indeed announced that US-Standard does offer read-after-write consistency on new object creation, but only when the s3-external-1 endpoint is used. They explain it as though it's a new behavior, and that may be the case... but it also may simply be a change in the behavior the platform officially supports. Either way:
Starting [2015-06-19], the US Standard Region now supports read-after-write consistency for new objects added to Amazon S3 using the Northern Virginia endpoint (s3-external-1.amazonaws.com). With this change, all Amazon S3 Regions now support read-after-write consistency. Read-after-write consistency allows you to retrieve objects immediately after creation in Amazon S3. Prior to this change, Amazon S3 buckets in the US Standard Region provided eventual consistency for newly created objects, which meant that some small set of objects might not have been available to read immediately after new object upload. These occasional delays could complicate data processing workflows where applications need to read objects immediately after creating the objects. Please note that in US Standard Region, this consistency change applies to the Northern Virginia endpoint (s3-external-1.amazonaws.com). Customers using the global endpoint (s3.amazonaws.com) should switch to using the Northern Virginia endpoint (s3-external-1.amazonaws.com) in order to leverage the benefits of this read-after-write consistency in the US Standard Region. [emphasis added]
https://forums.aws.amazon.com/ann.jspa?annID=3112
If you are uploading a large number of files (hundreds per second), you might also be overwhelming S3's sharding mechanism. For very high numbers of uploads per second, it's important that your keys ("filenames") not be lexically sequential.
Depending on how Amazon handles DNS, you may also want to try another alternate variant of addressing your bucket if your code can handle it.
Buckets in US-Standard can be addressed either with http://mybucket.s3.amazonaws.com/key ... or http://s3.amazonaws.com/mybucket/key ... and the internal implementation of these two could, at least in theory, be different in a way that changes the behavior in a way that would be relevant to your issue.

As you noted, currently there is no guarantee or workaround eventual consistency directly from S3. In this talk from Netflix, the speaker mentions having seen a 7h (extremely rare IMHO) consistency delay. They even created a consistency layer on top of S3, s3mper ,that is open source and might help in your context.
Other than that, as #Michael - sqlbot suggested, us-standard dos not offer read-after-write consistency, and the observed consistency delays may be different there.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js