I've seen many environments where critical data is backed up to Amazon S3 and it is assumed that this will basically never fail.
I know that Amazon reports that data stored in S3 has 99.999999999% durability (11 9's), but one thing that I'm struck by is the following passage from the AWS docs:
Amazon S3 provides a highly durable storage infrastructure designed
for mission-critical and primary data storage. Objects are redundantly
stored on multiple devices across multiple facilities in an Amazon S3
region.
So, S3 objects are only replicated within a single AWS region. Say there's an earthquake in N. California that decimates the whole region. Does that mean N. California S3 data has gone with it?
I'm curious what others consider best practices with respect to persisting mission-critical data in S3?
Related
This is a bit generic, but I was wondering, is there a specific bucket setup that provides better durability for the data hosted in the case of technical failures of the specific data center location or regardless of which data center location and storage class are chosen, Amazon ensures the durability/integrity of customers data in case of a technical failure? Is there a need for multi-region replication for the sake of receiving better protection against data loss or that's something used purely for compliance/latency purposes?
S3 provides 99.999999999% durability which is more then enough for most scenarios. But if you want even better protection of your data, you have S3 replication to different region or even different region.
Replication is not only used for compliance/latency. It is used for backup as well for most critical data that can't be lost in an unlikely situation of entire AWS region failure, e.g. due to massive earthquake.
I understand that the documentation states that multi-regional and regional are not inter-convertible, but fail to see the technical hindrance to it
Objects stored in multi-regional buckets are geo-redundant. That is to say, data kept in multi-regional buckets are stored in at least two separate places that are separated by at least 100 miles. Geo-redundancy ensures maximum availability of your data, even in the event of large-scale disruptions, such as natural disasters.
Regional buckets keep multiple copies of your data in one specific regional location. This has performance advantages for data-intensive computations.
Converting a regional bucket to a multi-regional bucket would require moving at least one full copy of the data of the bucket from one location to another. For very large buckets, this transfer would take a significant period of time. It's not currently available as a built-in, instantaneous feature.
Google Cloud does, however, offer a Data Transfer Service which can manage moving objects from a regional bucket to a multi-regional bucket.
Can some one help me in understanding the S3 outage usecase here.
The probability of S3 outage is very less, but in case if this happens, what are the ways we can access data that sits in S3.
I know that there is one possibility, that is cross region replication, that works for new files, that I am going to put in my s3 bucket, if I enable it now. What happen to old files, I know if I go and upload all those historical files also to the other region, then it works.
Then again the same question, if both the regions went down, then what?
I am sure others would have thought of this. Any inputs on this.
From Protecting Data in Amazon S3:
Objects are redundantly stored on multiple devices across multiple facilities in an Amazon S3 region. To help better ensure data durability, Amazon S3 PUT and PUT Object copy operations synchronously store your data across multiple facilities before returning SUCCESS. Once the objects are stored, Amazon S3 maintains their durability by quickly detecting and repairing any lost redundancy.
...
Backed with the Amazon S3 Service Level Agreement
Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year
Designed to sustain the concurrent loss of data in two facilities
So, if you're still not happy with all those statements, how can you access your data in an outage?
If your data is in only one region, and the region is not accessible, then your data is not accessible. Note, however, that an external network connectivity problem could prevent access to Amazon S3, yet Amazon S3 might still be accessible from Amazon EC2 instances in the same region.
Cross-region replication will copy your data to another Amazon S3 region. It requires versioning to be activated. To copy any files that exist prior to activating cross-region replication, use the sync command in the AWS Command-Line Utility (CLI), eg:
aws s3 sync s3://bucket1/folder s3://bucket2/folder
Each AWS region operates independently, so the possibility of multiple regions suffering outages would presumably be even less likely.
If you are feeling particularly paranoid, you could copy your data to another cloud provider (Azure, Google, Rackspace, etc). There are tools that can assist:
CloudBerry Cloud Migrator
AzureCopy
...and no doubt many more!
I am working on an app which uses S3 to store important documents. These documents need to be backed up on a daily, weekly rotation basis much like how database backups are maintained.
Does S3 support a feature where a bucket can be backup up into multiple buckets periodically or perhaps in Amazon Glacier. I want to avoid using an external service as much as possible, and was hoping S3 had some mechanism to do this, as its a common usecase.
Any help would be appreciated.
Quote from Amazon S3 FAQ about durability:
Amazon S3 is designed to provide 99.999999999% durability of objects over a given year. This durability level corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000,000 years
These numbers mean, first of all, that they are almost unbeatable. In other words, your data is safe in Amazon S3.
Thus, the only reason why you would need to backup your data objects is to prevent their accidental loss (by your own mistake). To solve this problem Amazon S3 enables versioning of S3 objects. Enable this feature on your S3 bucket and you're safe.
ps. Actually, there is one more possible reason - cost optimization. Amazon Glacier is cheaper than S3. I would recommend to use AWS Data Pipeline to move S3 data to Glacier routinely.
Regarding Glacier, you can make settings on your bucket to backup (old) s3 data to glaciaer if it is older than specified duration. This can save you cost if you want infrequently accessed data to be archived.
In s3 bucket there are lifecycle rules using which we can automatically move data from s3 to glaciers.
but if you want to access these important documents frequently from backup then you can also use another S3 bucket for backup your data.This backup can be scheduled using AWS datapipeline daily,weekly etc.
*Glaciers are cheaper than S3 as data is stored in compressed format in galaciers.
I created a Windows application that will allow you to schedule S3 bucket backups. You can create three kinds of backups: Cumulative, Synchronized and Snapshots. You can also include or exclude root level folders and files from your backups. You can try it free with no registration at https://www.bucketbacker.com
Given that S3 is 99.999999999% durability [1], what is the equivalent figure for DynamoDB?
[1] http://aws.amazon.com/s3/
This question implies something that is incorrect. Though S3 has an SLA (aws.amazon.com/s3-sla) that SLA references availability (99.9%) but has no reference to durability, or the loss of objects in S3.
The 99.999999999% durability figure comes from Amazon's estimate of what S3 is designed to achieve and there is no related SLA.
Note that Amazon S3 is designed for 99.99% availability but the SLA kicks in at 99.9%.
There is no current DynamoDB SLA from Amazon, nor am I aware of any published figures from Amazon on the expected or designed durability of data in DynamoDB. I would suspect that it is less than S3 given the nature, relative complexities, and goals of the two systems (i.e., S3 is designed to simply store data objects very, very safely; DynamoDB is designed to provide super-fast reads and writes in a scalable distributed database while also trying to keep your data safe).
Amazon talks about customers backing up DynamoDB to S3 using MapReduce. They also say that some customers back up DynamoDB using Redshift, which has DynamoDB compatibility built in. I additionally recommend backing up to an off-AWS store to remove the single point of failure that is your AWS account.
Although the DynamoDB FAQ doesn't use the exact same wording as you can see from my highlights below both DynamoDB & S3 are designed to be fault tolerant, with data stored in three facilities.
I wasn't able to find exact figures reported anywhere, but from the information I did have it looks like DynamoDB is pretty durable (on par with S3), although that won't stop it from having service interruptions from time to time. See this link:
http://www.forbes.com/sites/kellyclay/2013/02/20/amazons-aws-experiencing-problems-again/
S3 FAQ: http://aws.amazon.com/s3/faqs/#How_is_Amazon_S3_designed_to_achieve_99.999999999%_durability
Q: How durable is Amazon S3? Amazon S3 is designed to provide
99.999999999% durability of objects over a given year.
In addition, Amazon S3 is designed to sustain the concurrent loss of
data in two facilities.
Also Note: The "99.999999999%" figure for S3 is over a given year.
DynamoDB FAQ: http://aws.amazon.com/dynamodb/faqs/#Is_there_a_limit_to_how_much_data_I_can_store_in_Amazon_DynamoDB
Scale, Availability, and Durability
Q: How highly available is Amazon DynamoDB?
The service runs across Amazon’s proven, high-availability data
centers. The service replicates data across three facilities in an AWS
Region to provide fault tolerance in the event of a server failure or
Availability Zone outage.
Q: How does Amazon DynamoDB achieve high uptime and durability?
To achieve high uptime and durability, Amazon DynamoDB synchronously
replicates data across three facilities within an AWS Region.