Amazon S3 consistency model for new objects - amazon-web-services

Looking at https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html,
Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all regions with one caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write.
My understanding from this is that if I create a new object and I haven't checked for its existence beforehand, it should be available immediately (e.g., show up in list requests).
But the above link also says
...you might observe the following behaviors:
A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.
which contradicts the first statement as it basically says read-after-write consistency is always eventual for PUTS.

I read it as:
Amazon guarantees that a ReadObject request (GET and HEAD) will succeed (read-after-write consistency) for any newly PUT object, assuming you haven't requested the object before
Amazon doesn't guarantee that a ListBucket request will be immediately consistent for any newly PUT object, but rather the new object will eventually show up in a ListBucket request (eventual consistency)

S3 is now strongly consistent, all S3 GET, PUT, and LIST operations, as well as operations that change object tags, ACLs, or metadata, are now strongly consistent. What you write is what you will read, and the results of a LIST will be an accurate reflection of what’s in the bucket. This applies to all existing and new S3 objects, works in all regions, and is available to you at no extra charge! There’s no impact on performance, you can update object hundreds of times per second if you’d like, and there are no global dependencies.
https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/

Related

Remove old data from S3 based on last modified timestamp

I am working on a project dealing with images. It stores all the images in Amazon S3 and do some editing and then store that edited images again in S3 and then use the S3 urls.
Now, there are lot of images (>100000) and I need to query on what images were modified an year back so that I can save on my s3 cost by removing those images.
Lifecycle Rules are the S3 Feature that helps you transition objects automatically to either cheaper storage classes or delete them after a certain period of time.
You can create these on the bucket for specific prefixes and then choose an action for the objects that match the prefix. These actions will be applied to the objects x amount of time after they have been created/modified based on your configuration.
Be aware that this happens asynchronously and not immediately, but usually within 48 hours if I recall correctly. Lifecycle rules have the benefit of being free.
Here's some more information:
Managing your storage lifecycle
Lifecycle configuration elements
You can specify lifecycle transitions and delete or move less frequently used objects/images to low cost storage. Please read https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-transition-general-considerations.html

AWS S3 batch operation - Got Dinged Pretty Hard

We used the newly introduced AWS S3 batch operation to back up our S3 bucket, which had about 15 TB of data, to Glacier S3 . Prior to backing up we had estimated the bandwidth and storage costs and also taken into account mandatory 90 day storage requirement for Glacier.
However, the actual costs turned out to be massive compared to our estimated cost. We somehow overlooked the UPLOAD requests costs which runs at $0.05 per 1000 requests. We have many millions of files and each file upload was considered as a request and we are looking at several thousand dollars worth of spend :(
I am wondering if there was any way to avoid this?
The concept of "backup" is quite interesting.
Traditionally, where data was stored on one disk, a backup was imperative because it's not good to have a single point-of-failure.
Amazon S3, however, stores data on multiple devices across multiple Availability Zones (effectively multiple data centers), which is how they get their 99.999999999% durability and 99.99% availability. (Note that durability means the likelihood of retaining the data, which isn't quite the same as availability which means the ability to access the data. I guess the difference is that during a power outage, the data might not be accessible, but it hasn't been lost.)
Therefore, the traditional concept of taking a backup in case of device failure has already been handled in S3, all for the standard cost. (There is an older Reduced Redundancy option that only copied to 2 AZs instead of 3, but that is no longer recommended.)
Next comes the concept of backup in case of accidental deletion of objects. When an object is deleted in S3, it is not recoverable. However, enabling versioning on a bucket will retain multiple versions including deleted objects. This is great where previous histories of objects need to be kept, or where deletions might need to be undone. The downside is that storage costs include all versions that are retained.
There is also the new object lock capabilities in S3 where objects can be locked for a period of time (eg 3 years) without the ability to delete them. This is ideal for situations where information must be retained for a period and it avoids accidental deletion. (There is also a legal hold capability that is the same, but can be turned on/off if you have appropriate permissions.)
Finally, there is the potential for deliberate malicious deletion if an angry staff member decides to take revenge on your company for not stocking their favourite flavour of coffee. If an AWS user has the necessary permissions, they can delete the data from S3. To guard against this, you should limit who has such permissions and possibly combine it with versioning (so they can delete the current version of an object, but it is actually retained by the system).
This can also be addressed by using Cross-Region Replication of Amazon S3 buckets. Some organizations use this to copy data to a bucket owned by a different AWS account, such that nobody has the ability to delete data from both accounts. This is closer to the concept of a true backup because the copy is kept separate (account-wise) from the original. The extra cost of storage is minimal compared to the potential costs if the data was lost. Plus, if you configure the replica bucket to use the Glacier Deep Archive storage class, the costs can be quite low.
Your copy to Glacier is another form of backup (and offers cheaper storage than S3 in the long-term), but it would need to be updated at a regular basis to be a continuous backup (eg by using backup software that understands S3 and Glacier). The "5c per 1000 requests" cost means that it is better used for archives (eg large zip files) rather than many, small files.
Bottom line: Your need for a backup might be as simple as turning on Versioning and limiting which users can totally delete an object (including all past versions) from the bucket. Or, create a bucket replica and store it in Glacier Deep Archive storage class.

Google Cloud Storage Transfer

We’ve been using Google Cloud Storage Transfer service and in our data source (AWS) we had a directory accidentally deleted, so we figured it would be in the data sink however upon taking a looking it wasn’t there despite versioning being on.
This leads us to believe in Storage Transfer the option deleteObjectsUniqueInSink hard deletes objects in the sink and removes them from the archive.
We'e been unable to confirm this in the documentation.
Is GCS Transfer Service's deleteObjectsUniqueInSink parameter in the TransferSpec mutually exclusive with GCS's object versioning soft-delete?
When the deleteObjectsUniqueInSink option is enabled, Google Cloud Storage Transfer will
List only the live versions of objects in source and destination buckets.
Copy any objects unique in the source to the destination bucket.
Issue a versioned delete for any unique objects in the destination bucket.
If the unique object is still live at the time that Google Cloud Storage Transfer issues the deletion, it will be archived. If another process, such as Object Lifecycle Management, archived the object before the deletion occurs, the object could be permanently deleted at this point rather than archived.
Edit: Specifying the version in the delete results in a hard delete (Objects Delete Documentation), so transfer service is currently performing hard deletes for unique objects. We will update the service to instead perform soft deletions.
Edit: The behavior has been changed. From now on deletions in versioned buckets will be soft deletes rather than hard deletes.

Amazon S3 and Cloud Front

How do I integrate Amazon Cloud Front and S3 in a photo sharing application?
I currently upload to S3, and return the cloudfront url but this has not been very successful because it appears there is a latency between s3 and cloudfront such that the returned url is not immediately valid.
Does any know how I can work around this?
Facebook uses Akamai and if I upload an image it is immediately available.
Would appreciate some ideas on this.
You must be trying to fetch the object immediately through cloudfront. I'm unsure why that might be, but you are hitting the limits of S3's eventual consistency model.
When you upload an object, the message takes a tiny amount of time to propagate across the S3 service. Generally this is well under one second and is hard to detect. (in a previous life job, we found we could reasonably guarantee all files arrived within 10 seconds, and 99.9% within 1 second)
Here's the official word from AWS; it's worth reading the whole page:
A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.
There's a much longer discussion on this stackoverflow question; assuming you are using the standard S3 bucket, you need to change your endpoint slightly to take advantage of the read-after-write model.
Further reading:
* Instrumental: Why you should stop using the us-standard Region in S3. Right Now™
* Read-After-Write Consistency in Amazon S3 (from 2009, contains dated info)
One way you can debug/prove this is by calling getObjectMetadata right before your CloudFront call. It should fail in this case.

Recursively deleting versioned objects in Amazon S3

According to the documentation objects can only be deleted permanently by also supplying their version number.
I had a look at Python's Boto and it seems simple enough for small sets of objects. But if I have a folder that contains 100 000 objects, it would have to delete them one by one and that would take some time.
Is there a better way to go about this?
An easy way to delete versioned objects in an Amazon S3 bucket is to create a lifecycle rule. The rule activates on a batch basis (Midnight UTC?) and can delete objects within specified paths and it knows how to handle versioned objects.
See:
Lifecycle Configuration for a Bucket with Versioning
Such deletions do not count towards the API call usage count, so it can be cheaper, too!