Background
We use Amazon S3 in our project as a storage for files uploaded by clients.
For technical reasons, we upload a file to S3 with a temporary name, then process its contents and rename the file after it has been processed.
Problem
The 'rename' operation fails time after time with 404 (key not found) error, although the file being renamed had been uploaded successfully.
Amazon docs mention this problem:
Amazon S3 achieves high availability by replicating data across multiple servers within Amazon's data centers.
If a PUT request is successful, your data is safely stored. However, information about the changes must replicate across Amazon S3, which can take some time, and so you might observe the following behaviors:
We implemented a kind of polling as workaround: retry the 'rename' operation until it succeeds.
The polling stops after 20 seconds.
This workaround works in most cases: the file gets replicated within few seconds.
But sometimes — very rarely — 20 seconds are not enough; the replication in S3 takes more time.
Questions
What is the maximum time you observed between a successful PUT operation and complete replication on Amazon S3?
Does Amazon S3 offer a way to 'bypass' replication? (Query 'master' directly?)
Update: this answer uses some older terminology, which i have left in place, for the most part. AWS has changed the friendly name of "US-Standard" to be more consistent with the naming of other regions, but its regional endpoint for IPv4 still has the unusual name s3-external-1.amazonaws.com.
The us-east-1 region of S3 has an IPv4/IPv6 "dual stack" endpoint that follows the standard convention of s3.dualstack.us-east-1.amazonaws.com and if you are IPv6 enabled, this endpoint seems operationally-equivalent to s3-external-1 as discussed below.
The documented references to geographic routing of requests for this region seem to have largely disappeared, without much comment, but anecdotal evidence suggests that the following information is still relevant to that region.
Q. Wasn’t there a US Standard region?
We renamed the US Standard Region to US East (Northern Virginia) Region to be consistent with AWS regional naming conventions.
— https://aws.amazon.com/s3/faqs/#regions
Buckets using the S3 Transfer Acceleration feature use a global-style endpoint of ${bucketname}.s3-accelerate.amazonaws.com and it is not yet evident how this endpoint behaves with regard to us-east-1 buckets and eventual consistency, though it stands to reason that other regions should not be affected by this feature, if enabled. This feature improves transfer throughput for users who are more distant from the bucket by routing requests to the same S3 endpoints but proxying through the AWS "Edge Network," the same system that powers CloudFront. It is, essentially, a self-configuring path through CloudFront but without caching enabled. The acceleration comes from optimized network stacks and keeping the traffic on the managed AWS network for much of its path across the Internet. As such, this feature should have no impact on consistency, if you enable and use it on a bucket... but, as I mentioned, how it interacts with us-east-1 buckets is not yet known.
The US-Standard (us-east-1) region is the oldest, and presumably largest, region of S3, and does play by some different rules than the other, newer regions.
An important and relevant difference is the consistency model.
Amazon S3 buckets in [all regions except US Standard] provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES. Amazon S3 buckets in the US Standard region provide eventual consistency.
http://aws.amazon.com/s3/faqs/
This is why I assumed you were using US Standard. The behavior you described is consistent with that design constraint.
You should be able to verify that this doesn't happen with a test bucket in another region... but, because data transfer from EC2 to S3 within the same region is free and very low latency, using a bucket in a different region may not be practical.
There is another option that is worth trying, has to do with the inner-workings of US-Standard.
US Standard is in fact geographically-distributed between Virginia and Oregon, and requests to "s3.amazonaws.com" are selectively routed via DNS to one location or another. This routing is largely a black box, but Amazon has exposed a workaround.
You can force your requests to be routed only to Northern Virginia by changing your endpoint from "s3.amazonaws.com" to "s3-external-1.amazonaws.com" ...
http://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region
... this is speculation on my part, but your issue may be exacerbated by geographic routing of your requests, and forcing them to "s3-external-1" (which, to be clear, is still US-Standard), might improve or eliminate your issue.
Update: The advice above has officially risen above speculation, but I'll leave it for historical reference. About a year I wrote the above, Amazon indeed announced that US-Standard does offer read-after-write consistency on new object creation, but only when the s3-external-1 endpoint is used. They explain it as though it's a new behavior, and that may be the case... but it also may simply be a change in the behavior the platform officially supports. Either way:
Starting [2015-06-19], the US Standard Region now supports read-after-write consistency for new objects added to Amazon S3 using the Northern Virginia endpoint (s3-external-1.amazonaws.com). With this change, all Amazon S3 Regions now support read-after-write consistency. Read-after-write consistency allows you to retrieve objects immediately after creation in Amazon S3. Prior to this change, Amazon S3 buckets in the US Standard Region provided eventual consistency for newly created objects, which meant that some small set of objects might not have been available to read immediately after new object upload. These occasional delays could complicate data processing workflows where applications need to read objects immediately after creating the objects. Please note that in US Standard Region, this consistency change applies to the Northern Virginia endpoint (s3-external-1.amazonaws.com). Customers using the global endpoint (s3.amazonaws.com) should switch to using the Northern Virginia endpoint (s3-external-1.amazonaws.com) in order to leverage the benefits of this read-after-write consistency in the US Standard Region. [emphasis added]
https://forums.aws.amazon.com/ann.jspa?annID=3112
If you are uploading a large number of files (hundreds per second), you might also be overwhelming S3's sharding mechanism. For very high numbers of uploads per second, it's important that your keys ("filenames") not be lexically sequential.
Depending on how Amazon handles DNS, you may also want to try another alternate variant of addressing your bucket if your code can handle it.
Buckets in US-Standard can be addressed either with http://mybucket.s3.amazonaws.com/key ... or http://s3.amazonaws.com/mybucket/key ... and the internal implementation of these two could, at least in theory, be different in a way that changes the behavior in a way that would be relevant to your issue.
As you noted, currently there is no guarantee or workaround eventual consistency directly from S3. In this talk from Netflix, the speaker mentions having seen a 7h (extremely rare IMHO) consistency delay. They even created a consistency layer on top of S3, s3mper ,that is open source and might help in your context.
Other than that, as #Michael - sqlbot suggested, us-standard dos not offer read-after-write consistency, and the observed consistency delays may be different there.
Related
I have been trying to reason why an S3 bucket name has to be globally unique. I came across the stackoverflow answer as well that says in order to resolve host header, bucket name got to be unique. However, my point is can't AWS direct the s3-region.amazonaws.com to region specific web server that can serve the bucket object from that region? That way the name could be globally unique only for a region. Meaning, the same bucket could be created in a different region. Please let me know if my understanding is completely wrong on how name resolution works or otherwise?
There is not, strictly speaking, a technical reason why the bucket namespace absolutely had to be global. In fact, it technically isn't quite as global as most people might assume, because S3 has three distinct partitions that are completely isolated from each other and do not share the same global bucket namespace across partition boundaries -- the partitions are aws (the global collection of regions most people know as "AWS"), aws-us-gov (US GovCloud), and aws-cn (the Beijing and Ningxia isolated regions).
So things could have been designed differently, with each region independent, but that is irrelevant now, because the global namespace is entrenched.
But why?
The specific reasons for the global namespace aren't publicly stated, but almost certainly have to do with the evolution of the service, backwards compatibility, and ease of adoption of new regions.
S3 is one of the oldest of the AWS services, older than even EC2. They almost certainly did not foresee how large it would become.
Originally, the namespace was global of necessity because there weren't multiple regions. S3 had a single logical region (called "US Standard" for a long time) that was in fact comprised of at least two physical regions, in or near us-east-1 and us-west-2. You didn't know or care which physical region each upload went to, because they replicated back and forth, transparently, and latency-based DNS resolution automatically gave you the endpoint with the lowest latency. Many users never knew this detail.
You could even explicitly override the automatic geo-routing of DNS amd upload to the east using the s3-external-1.amazonaws.com endpoint or to the west using the s3-external-2.amazonaws.com endpoint, but your object would shortly be accessible from either endpoint.
Up until this point, S3 did not offer immediate read-after-write consistency on new objects since that would be impractical in the primary/primary, circular replication environment that existed in earlier days.
Eventually, S3 launched in other AWS regions as they came online, but they designed it so that a bucket in any region could be accessed as ${bucket}.s3.amazonaws.com.
This used DNS to route the request to the correct region, based on the bucket name in the hostname, and S3 maintained the DNS mappings. *.s3.amazonaws.com was (and still is) a wildcard record that pointed everything to "S3 US Standard" but S3 would create a CNAME for your bucket that overrode the wildcard and pointed to the correct region, automatically, a few minutes after bucket creation. Until then, S3 would return a temporary HTTP redirect. This, obviously enough, requires a global bucket namespace. It still works for all but the newest regions.
But why did they do it that way? After all, at around the same time S3 also introduced endpoints in the style ${bucket}.s3-${region}.amazonaws.com ¹ that are actually wildcard DNS records: *.s3-${region}.amazonaws.com routes directly to the regional S3 endpoint for each S3 region, and is a responsive (but unusable) endpoint, even for nonexistent buckets. If you create a bucket in us-east-2 and send a request for that bucket to the eu-west-1 endpoint, S3 in eu-west-1 will throw an error, telling you that you need to send the request to us-east-2.
Also, around this time, they quietly dropped the whole east/west replication thing, and later renamed US Standard to what it really was at that point -- us-east-1. (Buttressing the "backwards compatibility" argument, s3-external-1 and s3-external-2 are still valid endpoints, but they both point to precisely the same place, in us-east-1.)
So why did the bucket namespace remain global? The only truly correct answer an outsider can give is "because that's what the decided to do."
But perhaps one factor was that AWS wanted to preserve compatibility with existing software that used ${bucket}.s3.amazonaws.com so that customers could deploy buckets in other regions without code changes. In the old days of Signature Version 2 (and earlier), the code that signed requests did not need to know the API endpoint region. Signature Version 4 requires knowledge of the endpoint region in order to generate a valid signature because the signing key is derived against the date, region, and service... but previously it wasn't like that, so you could just drop in a bucket name and client code needed no regional awareness -- or even awareness that S3 even had regions -- in order to work with a bucket in any region.
AWS is well-known for its practice of preserving backwards compatibility. They do this so consistently that occasionally some embarrassing design errors creep in and remain unfixed because to fix them would break running code.²
Another issue is virtual hosting of buckets. Back before HTTPS was accepted as non-optional, it was common to host ststic content by pointing your CNAME to the S3 endpoint. If you pointed www.example.com to S3, it would serve the content from a bucket with the exact name www.example.com. You can still do this, but it isn't useful any more since it doesn't support HTTPS. To host static S3 content with HTTPS, you use CloudFront in front of the bucket. Since CloudFront rewrites the Host header, the bucket name can be anything. You might be asking why you couldn't just point the www.example.com CNAME to the endpoint hostname of your bucket, but HTTP and DNS operate at very different layers and it simply doesn't work that way. (If you doubt this assertion, try pointing a CNAME from a domain that you control to www.google.com. You will not find that your domain serves the Google home page; instead, you'll be greeted with an error because the Google server will only see that it's received a request for www.example.com, and be oblivious to the fact that there was an intermediate CNAME pointing to it.) Virtual hosting of buckets requires either a global bucket namespace (so the Host header exactly matches the bucket) or an entirely separate mapping database of hostnames to bucket names... and why do that when you already have an established global namespace of buckets?
¹ Note that the - after s3 in these endpoints was eventually replaced by a much more logical . but these old endpoints still work.
² two examples that come to mind: (1) S3's incorrect omission of the Vary: Origin response header when a non-CORS request arrives at a CORS-enabled bucket (I have argued without success that this can be fixed without breaking anything, to no avail); (2) S3's blatantly incorrect handling of the symbol + in an object key, on the API, where the service interprets + as meaning %20 (space) so if you want a browser to download from a link to /foo+bar you have to upload it as /foo{space}bar.
You create an S3 bucket in a specific region only and objects stored in a bucket is only stored in that region itself. The data is neither replicated nor stored in different regions, unless you setup replication on a per bucket basis.
However. AWS S3 shares a global name space with all accounts. The name given to an S3 bucket should be unique
This requirement is designed to support globally unique DNS names for each bucket eg. http://bucketname.s3.amazonaws.com
Recently, S3 announces strong read-after-write consistency. I'm curious as to how one can program that. Doesn't it violate the CAP theorem?
In my mind, the simplest way is to wait for the replication to happen and then return, but that would result in performance degradation.
AWS says that there is no performance difference. How is this achieved?
Another thought is that amazon has a giant index table that keeps track of all S3 objects and where it is stored (triple replication I believe). And it will need to update this index at every PUT/DELTE. Is that technically feasible?
As indicated by Martin above, there is a link to Reddit which discusses this. The top response from u/ryeguy gave this answer:
If I had to guess, s3 synchronously writes to a cluster of storage nodes before returning success, and then asynchronously replicates it to other nodes for stronger durability and availability. There used to be a risk of reading from a node that didn't receive a file's change yet, which could give you an outdated file. Now they added logic so the lookup router is aware of how far an update is propagated and can avoid routing reads to stale replicas.
I just pulled all this out of my ass and have no idea how s3 is actually architected behind the scenes, but given the durability and availability guarantees and the fact that this change doesn't lower them, it must be something along these lines.
Better answers are welcome.
Our assumptions will not work in the Cloud systems. There are a lot of factors involved in the risk analysis process like availability, consistency, disaster recovery, backup mechanism, maintenance burden, charges, etc. Also, we only take reference of theorems while designing. we can create our own by merging multiple of them. So I would like to share the link provided by AWS which illustrates the process in detail.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html
When you create a cluster with consistent view enabled, Amazon EMR uses an Amazon DynamoDB database to store object metadata and track consistency with Amazon S3. You must grant EMRFS role with permissions to access DynamoDB. If consistent view determines that Amazon S3 is inconsistent during a file system operation, it retries that operation according to rules that you can define. By default, the DynamoDB database has 400 read capacity and 100 write capacity. You can configure read/write capacity settings depending on the number of objects that EMRFS tracks and the number of nodes concurrently using the metadata. You can also configure other database and operational parameters. Using consistent view incurs DynamoDB charges, which are typically small, in addition to the charges for Amazon EMR.
I have been trying to reason why an S3 bucket name has to be globally unique. I came across the stackoverflow answer as well that says in order to resolve host header, bucket name got to be unique. However, my point is can't AWS direct the s3-region.amazonaws.com to region specific web server that can serve the bucket object from that region? That way the name could be globally unique only for a region. Meaning, the same bucket could be created in a different region. Please let me know if my understanding is completely wrong on how name resolution works or otherwise?
There is not, strictly speaking, a technical reason why the bucket namespace absolutely had to be global. In fact, it technically isn't quite as global as most people might assume, because S3 has three distinct partitions that are completely isolated from each other and do not share the same global bucket namespace across partition boundaries -- the partitions are aws (the global collection of regions most people know as "AWS"), aws-us-gov (US GovCloud), and aws-cn (the Beijing and Ningxia isolated regions).
So things could have been designed differently, with each region independent, but that is irrelevant now, because the global namespace is entrenched.
But why?
The specific reasons for the global namespace aren't publicly stated, but almost certainly have to do with the evolution of the service, backwards compatibility, and ease of adoption of new regions.
S3 is one of the oldest of the AWS services, older than even EC2. They almost certainly did not foresee how large it would become.
Originally, the namespace was global of necessity because there weren't multiple regions. S3 had a single logical region (called "US Standard" for a long time) that was in fact comprised of at least two physical regions, in or near us-east-1 and us-west-2. You didn't know or care which physical region each upload went to, because they replicated back and forth, transparently, and latency-based DNS resolution automatically gave you the endpoint with the lowest latency. Many users never knew this detail.
You could even explicitly override the automatic geo-routing of DNS amd upload to the east using the s3-external-1.amazonaws.com endpoint or to the west using the s3-external-2.amazonaws.com endpoint, but your object would shortly be accessible from either endpoint.
Up until this point, S3 did not offer immediate read-after-write consistency on new objects since that would be impractical in the primary/primary, circular replication environment that existed in earlier days.
Eventually, S3 launched in other AWS regions as they came online, but they designed it so that a bucket in any region could be accessed as ${bucket}.s3.amazonaws.com.
This used DNS to route the request to the correct region, based on the bucket name in the hostname, and S3 maintained the DNS mappings. *.s3.amazonaws.com was (and still is) a wildcard record that pointed everything to "S3 US Standard" but S3 would create a CNAME for your bucket that overrode the wildcard and pointed to the correct region, automatically, a few minutes after bucket creation. Until then, S3 would return a temporary HTTP redirect. This, obviously enough, requires a global bucket namespace. It still works for all but the newest regions.
But why did they do it that way? After all, at around the same time S3 also introduced endpoints in the style ${bucket}.s3-${region}.amazonaws.com ¹ that are actually wildcard DNS records: *.s3-${region}.amazonaws.com routes directly to the regional S3 endpoint for each S3 region, and is a responsive (but unusable) endpoint, even for nonexistent buckets. If you create a bucket in us-east-2 and send a request for that bucket to the eu-west-1 endpoint, S3 in eu-west-1 will throw an error, telling you that you need to send the request to us-east-2.
Also, around this time, they quietly dropped the whole east/west replication thing, and later renamed US Standard to what it really was at that point -- us-east-1. (Buttressing the "backwards compatibility" argument, s3-external-1 and s3-external-2 are still valid endpoints, but they both point to precisely the same place, in us-east-1.)
So why did the bucket namespace remain global? The only truly correct answer an outsider can give is "because that's what the decided to do."
But perhaps one factor was that AWS wanted to preserve compatibility with existing software that used ${bucket}.s3.amazonaws.com so that customers could deploy buckets in other regions without code changes. In the old days of Signature Version 2 (and earlier), the code that signed requests did not need to know the API endpoint region. Signature Version 4 requires knowledge of the endpoint region in order to generate a valid signature because the signing key is derived against the date, region, and service... but previously it wasn't like that, so you could just drop in a bucket name and client code needed no regional awareness -- or even awareness that S3 even had regions -- in order to work with a bucket in any region.
AWS is well-known for its practice of preserving backwards compatibility. They do this so consistently that occasionally some embarrassing design errors creep in and remain unfixed because to fix them would break running code.²
Another issue is virtual hosting of buckets. Back before HTTPS was accepted as non-optional, it was common to host ststic content by pointing your CNAME to the S3 endpoint. If you pointed www.example.com to S3, it would serve the content from a bucket with the exact name www.example.com. You can still do this, but it isn't useful any more since it doesn't support HTTPS. To host static S3 content with HTTPS, you use CloudFront in front of the bucket. Since CloudFront rewrites the Host header, the bucket name can be anything. You might be asking why you couldn't just point the www.example.com CNAME to the endpoint hostname of your bucket, but HTTP and DNS operate at very different layers and it simply doesn't work that way. (If you doubt this assertion, try pointing a CNAME from a domain that you control to www.google.com. You will not find that your domain serves the Google home page; instead, you'll be greeted with an error because the Google server will only see that it's received a request for www.example.com, and be oblivious to the fact that there was an intermediate CNAME pointing to it.) Virtual hosting of buckets requires either a global bucket namespace (so the Host header exactly matches the bucket) or an entirely separate mapping database of hostnames to bucket names... and why do that when you already have an established global namespace of buckets?
¹ Note that the - after s3 in these endpoints was eventually replaced by a much more logical . but these old endpoints still work.
² two examples that come to mind: (1) S3's incorrect omission of the Vary: Origin response header when a non-CORS request arrives at a CORS-enabled bucket (I have argued without success that this can be fixed without breaking anything, to no avail); (2) S3's blatantly incorrect handling of the symbol + in an object key, on the API, where the service interprets + as meaning %20 (space) so if you want a browser to download from a link to /foo+bar you have to upload it as /foo{space}bar.
You create an S3 bucket in a specific region only and objects stored in a bucket is only stored in that region itself. The data is neither replicated nor stored in different regions, unless you setup replication on a per bucket basis.
However. AWS S3 shares a global name space with all accounts. The name given to an S3 bucket should be unique
This requirement is designed to support globally unique DNS names for each bucket eg. http://bucketname.s3.amazonaws.com
I am new to cloud computing and I was working on a research paper for s3 proactive replica checking. I have a few questions and I have tried and read many forums and research papers but I couldn't find answers anywhere or they may be too complicated for me to understand.
If I don't enable, cross regional replica for s3 storage, just created a new bucket, will AWS automatically create replicas for my storage anywhere?
Is there any Java code or tutorial available by which I can calculate the s3 replica checking time?
AWS has great documentation on their services so that's the place to start. This link should help: http://docs.aws.amazon.com/AmazonS3/latest/dev/DataDurability.html
To answer your first question, replication occurs automatically for all s3 objects in a given region and provides 11 9s durability unless you choose reduced redundancy storage.
Cross region replication is something you will have to enable and is not automatic. As for java code to test replication time, I'm not aware of any. However it seems you could do it fairly easily using the standard SDK and issue a PUT for an object and then time how long it takes to show up in the bucket of the region to which you have replicated it. I suspect that timing will depend on your origin and destination regions, but from my experience I can tell you even replicating from a US region to an Asia region is quite fast.
Can some one help me in understanding the S3 outage usecase here.
The probability of S3 outage is very less, but in case if this happens, what are the ways we can access data that sits in S3.
I know that there is one possibility, that is cross region replication, that works for new files, that I am going to put in my s3 bucket, if I enable it now. What happen to old files, I know if I go and upload all those historical files also to the other region, then it works.
Then again the same question, if both the regions went down, then what?
I am sure others would have thought of this. Any inputs on this.
From Protecting Data in Amazon S3:
Objects are redundantly stored on multiple devices across multiple facilities in an Amazon S3 region. To help better ensure data durability, Amazon S3 PUT and PUT Object copy operations synchronously store your data across multiple facilities before returning SUCCESS. Once the objects are stored, Amazon S3 maintains their durability by quickly detecting and repairing any lost redundancy.
...
Backed with the Amazon S3 Service Level Agreement
Designed to provide 99.999999999% durability and 99.99% availability of objects over a given year
Designed to sustain the concurrent loss of data in two facilities
So, if you're still not happy with all those statements, how can you access your data in an outage?
If your data is in only one region, and the region is not accessible, then your data is not accessible. Note, however, that an external network connectivity problem could prevent access to Amazon S3, yet Amazon S3 might still be accessible from Amazon EC2 instances in the same region.
Cross-region replication will copy your data to another Amazon S3 region. It requires versioning to be activated. To copy any files that exist prior to activating cross-region replication, use the sync command in the AWS Command-Line Utility (CLI), eg:
aws s3 sync s3://bucket1/folder s3://bucket2/folder
Each AWS region operates independently, so the possibility of multiple regions suffering outages would presumably be even less likely.
If you are feeling particularly paranoid, you could copy your data to another cloud provider (Azure, Google, Rackspace, etc). There are tools that can assist:
CloudBerry Cloud Migrator
AzureCopy
...and no doubt many more!