Is Amazon S3's ListObjectsV2 self-consistent over multiple pages? - amazon-web-services

ListObjectsV2 can only return 1000 results, at which point you have to go back for another page.
Since Amazon S3 is now strongly consistent, and other updates can be happening to the bucket while I am listing its contents, is the second page going to be more results from the same point in time as the first page? Or is it going to reflect the state of the bucket at the point in time when the second page was requested?
For example, if I list a bucket, get the first page, delete a key which would have appeared on the second page, and then get the second page, will I still see the key that is now deleted?

Indeed, Amazon S3 is now strongly consistent. This means once you upload an object, all people that read that object are guaranteed to get the updated version of the object. This does not meant that two different API calls are guaranteed to be in the same "state". Notably, for downloads, there is a situation where one download can get parts of two versions of the object if it's updated while being downloaded. More details are available in this answer.
As for you question, the same basic rules apply: S3 is strongly consistent from one call to the next, once you make a change to the bucket or objects, any call after that update is guaranteed to get the updated data. This means as you page through the list of objects, you will see the changes as each API call gets the latest state:
import boto3
BUCKET='example-bucket'
PREFIX='so_question'
s3 = boto3.client('s3')
# Create a bunch of items
for i in range(3000):
s3.put_object(Bucket=BUCKET, Key=f"{PREFIX}/obj_{i:04d}", Body=b'')
args = {'Bucket': BUCKET, 'Prefix': PREFIX + "/",}
result = s3.list_objects_v2(**args)
# This shows objects 0 to 999
print([x['Key'] for x in result['Contents']])
# Delete an object
s3.delete_object(Bucket=BUCKET, Key=f"{PREFIX}/obj_{1100:04d}")
# Request the next "page" of items
args['ContinuationToken'] = result['NextContinuationToken']
result = s3.list_objects_v2(**args)
# This will not show object 1100, showing objects 1000 to 2000
print([x['Key'] for x in result['Contents']])
The upside of this and there's no way to get a list of all objects in a bucket (assuming it has more than 1000 items) in one API call: there's no way I'm aware of to get a complete "snapshot" of the bucket at any point, unless you can ensure the bucket doesn't change during listing the objects, of course.

Related

Storing S3 Urls vs calling listObjects

I have an app that has an attachments feature for users. They can upload documents to S3 and then revisit and preview and/or Download said attachments.
I was planning on storing the S3 urls in DB and then pre-signing them when the User needs them. I'm finding a caveat here is that this can lead to edge cases between S3 and the DB.
I.e. if a file gets removed from S3 but its url does not get removed from DB (or vice-versa). This can lead to data inconsistency and may mislead users.
I was thinking of just getting the urls via the network by using listObjects in the s3 client SDK. I don't really need to store the urls and this guarantees the user gets what's actually in S3.
Only con here is that it makes 1 API request (as opposed to DB hit)
Any insights?
Thanks!
Using a database to store an index to files is a good idea, especially once the volume of objects increases. The ListObjects() API only returns 1000 objects per call. This might be okay if every user has their own path (so you can use ListObjects(Prefix='user1/'), but that's not ideal if you want to allow document sharing between users.
Using a database will definitely be faster to obtain a listing, and it has the advantage that you can filter on attributes and metadata.
The two systems will only get "out of sync" if objects are created/deleted outside of your app, or if there is an error in the app. If this concerns you, then use Amazon S3 Inventory, to provide a regular listing of objects in the bucket and write some code to compare it against the database entries. This will highlight if anything is going wrong.
While Amazon S3 is an excellent NoSQL database (Key = filename, Value = contents), it isn't good for searching/listing a large quantity of objects.

Boto3, S3 check if keys exist

Right now I do know how to check if a single key exists within my S3 bucket using Boto 3:
res = s3.list_objects_v2(Bucket=record.bucket_name, Prefix='back.jpg', Delimiter='/')
for obj in res.get('Contents', []):
print(obj)
However I'm wondering if it's possible to check if multiple keys exist within a single API call. It feels a bit of a waste to do 5+ requests for that.
You could either use head_object() to check whether a specific object exists, or retrieve the complete bucket listing using list_objects_v2() and then look through the returned list to check for multiple objects.
Please note that list_objects_v2() only returns 1000 objects at a time, so it might need several calls to retrieve a list of all objects.

Caveat in Amazon S3 Consistency

The caveat in Amazon S3 S3's PUT for new objects is that if you make a HEAD or GET request to a key name before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency.
Why's this? What issue is the first GET creating? Is it because S3 might look for the object in other AZs and in the meanwhile, there is a PUT made for the same? Is S3 returning the previous status(checked across AZs and not found)
I'm not aware of any public documentation that explains the reason for this caveat.
A quick reminder of what the statement is on S3 consistency:
Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all Regions with one caveat. The caveat is that if you make a HEAD or GET request to a key name before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency.
Here are some related, non-authoritative discussions:
consistency model caveat
explanation of S3 consistency model
The first of those two discussions speculates that the reason is that S3 may cache the 404 object not found response to the initial HEAD/GET request and consequently may return that cached result on the GET following an initial PUT until the PUT has fully propagated. But that's speculative.

caveat of the read-after-write consistency for PUTS of new objects in a S3 bucket

From https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html :
Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all regions with one caveat. The caveat is that if you make a HEAD or GET request to the key name (to find if the object exists) before creating the object, Amazon S3 provides eventual consistency for read-after-write.
I'm not sure if I understand the caveat correctly. Before creating the object: ok, I haven't yet created an object with the key K, therefore no object with the key K exists; I make a GET request to K... what does my request result to according to the explanation above?
I'm confused because the explanation tells about the eventual consistency for read-after-write. But there is no write so far.
Update 2020-12-02 This whole discussion is now outdated. Amazon S3 provides strong read-after-write consistency for PUTs and DELETEs of objects in your Amazon S3 bucket in all AWS Regions.
Update I rewrote the answer after reading a comment in this blog post.
I believe this caveat is talking about this scenario
client 1: GET key_a --> this could return an object even this request was sent earlier.
client 2: PUT key_a
This could be possible in case the request of client 1 reached later than the PUT request to a node.
This situation happens when you have a file to upload, but that file might already exist. So rather than overwrite the existing file, you do the following:
Try to GET the file. It doesn't exist, so you get a 404 with No such key
PUT the file.
Try to GET the file immediately afterward (for whatever reason).
In this sequence, step #3 may or may not return the file. Eventually you can retrieve the file, but how long that takes from the time of upload depends on the internals of S3 (I could speculate on why that happens, but it would only be speculation).

Is AWS S3 read guaranteed to return a newly created object?

I've been reading the docs regarding read-after-write consistency with AWS S3 but I'm still unsure about this.
If I write an object to S3 and after getting a successful response from my write operation, I immediately attempt to read it, is the read operation guaranteed to return the object?
In other words, is it possible that the read operation will fail because it can't find the object? Because the read happened too soon after the write?
I'm only talking about new PUTs here, not updates to existing objects.
Yes guaranteed to return the object (only for new objects) with one caveat:
As per AWS documentation:
Amazon S3 provides read-after-write consistency for PUTS of new
objects in your S3 bucket in all regions with one caveat. The caveat
is that if you make a HEAD or GET request to the key name (to find if
the object exists) before creating the object, Amazon S3 provides
eventual consistency for read-after-write.
Amazon S3 offers eventual consistency for overwrite PUTS and DELETES
in all regions.
EDIT: credits to #Michael - sqlbot, more on HEAD (or) GET caveat:
If you send a GET or HEAD before the object exists, such as to check whether there's an object there before you upload, then the upload is not immediately consistent for read requests even after the upload is complete, because S3 has already made the only immediately consistent internal query it's going to make for that object, discovering, authoritatively, that there's no such key. The object creation becomes eventually consistent, since the creation has to "overwrite" the previous lookup that found nothing.
Based on following table provided in the link, "consistent reads" will never be stale.
Above provided link has nice example regarding how "read-after-write consistency" & "eventual consistency" works.
I would like to add this caution note to this answer to make things more clear:
Amazon S3 achieves high availability by replicating data across multiple servers within Amazon's data centers. If a PUT request is successful, your data is safely stored. However, information about the changes must replicate across Amazon S3, which can take some time, and so you might observe the following behaviors:
A process writes a new object to Amazon S3 and immediately lists keys
within its bucket. Until the change is fully propagated, the object
might not appear in the list.