Boto3, S3 check if keys exist - amazon-web-services

Right now I do know how to check if a single key exists within my S3 bucket using Boto 3:
res = s3.list_objects_v2(Bucket=record.bucket_name, Prefix='back.jpg', Delimiter='/')
for obj in res.get('Contents', []):
print(obj)
However I'm wondering if it's possible to check if multiple keys exist within a single API call. It feels a bit of a waste to do 5+ requests for that.

You could either use head_object() to check whether a specific object exists, or retrieve the complete bucket listing using list_objects_v2() and then look through the returned list to check for multiple objects.
Please note that list_objects_v2() only returns 1000 objects at a time, so it might need several calls to retrieve a list of all objects.

Related

Is Amazon S3's ListObjectsV2 self-consistent over multiple pages?

ListObjectsV2 can only return 1000 results, at which point you have to go back for another page.
Since Amazon S3 is now strongly consistent, and other updates can be happening to the bucket while I am listing its contents, is the second page going to be more results from the same point in time as the first page? Or is it going to reflect the state of the bucket at the point in time when the second page was requested?
For example, if I list a bucket, get the first page, delete a key which would have appeared on the second page, and then get the second page, will I still see the key that is now deleted?
Indeed, Amazon S3 is now strongly consistent. This means once you upload an object, all people that read that object are guaranteed to get the updated version of the object. This does not meant that two different API calls are guaranteed to be in the same "state". Notably, for downloads, there is a situation where one download can get parts of two versions of the object if it's updated while being downloaded. More details are available in this answer.
As for you question, the same basic rules apply: S3 is strongly consistent from one call to the next, once you make a change to the bucket or objects, any call after that update is guaranteed to get the updated data. This means as you page through the list of objects, you will see the changes as each API call gets the latest state:
import boto3
BUCKET='example-bucket'
PREFIX='so_question'
s3 = boto3.client('s3')
# Create a bunch of items
for i in range(3000):
s3.put_object(Bucket=BUCKET, Key=f"{PREFIX}/obj_{i:04d}", Body=b'')
args = {'Bucket': BUCKET, 'Prefix': PREFIX + "/",}
result = s3.list_objects_v2(**args)
# This shows objects 0 to 999
print([x['Key'] for x in result['Contents']])
# Delete an object
s3.delete_object(Bucket=BUCKET, Key=f"{PREFIX}/obj_{1100:04d}")
# Request the next "page" of items
args['ContinuationToken'] = result['NextContinuationToken']
result = s3.list_objects_v2(**args)
# This will not show object 1100, showing objects 1000 to 2000
print([x['Key'] for x in result['Contents']])
The upside of this and there's no way to get a list of all objects in a bucket (assuming it has more than 1000 items) in one API call: there's no way I'm aware of to get a complete "snapshot" of the bucket at any point, unless you can ensure the bucket doesn't change during listing the objects, of course.

boto3 find object by metadata or tag

Is it possible to search objects in S3 bucket by object's metadata or tag key/value? (without object name or etag)
I know about head_object() method (ref), but it requires a Key in its parameters.
It seems that get_object() method is also not a solution - it takes the same argument set as head_object(), and nothing about metadata.
As I can see, neither get_* nor list_* methods provide any suitable filters. But I believe that such an opportunity should be in S3 API.
No. The ListObjects() API call does not accept search criteria.
You will need to retrieve a listing of all objects, then call head_object() to obtain metadata.
Alternatively, you could use Amazon S3 Inventory, which can provide a regular CSV file containing a list of all objects and their metadata. Your program could use this as a source of information rather than calling ListObjects().
If you require something that can do real-time searching of metadata, the common practice is to store such information in a database (eg DynamoDB, RDS, Elasticsearch) and then reference the database to identify the desired Amazon S3 objects.

How to build an index of S3 objects when data exceeds object metadata limit?

Building an index of S3 objects can be very useful to make them searchable quickly : the natural, most obvious way is to store additional data on the object meta-data and use a lambda to write in DynamoDB or RDS, as described here: https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
However, this strategy is limited by the amount of data one can store in the object metadata, which is 2KB, as described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html. Suppose you need to build a system where every time an object is uploaded on S3 you store need to add some information not contained in the file and the object name to a database and this data exceeds 2KB:you can't store it in the object metadata.
What are viable strategies to keep the bucket and the index updated?
Implement two chained API calls where each call is idempotent: if the second fails when the first succeed, one can retry until success. What happens if you perform a PUT of an identical object on S3, and you have versioning activated? Will S3 increase the version? In this case, implementing idempotency requires a single writer to be active at each time
Use some sort of workflow engine to keep track of this two-step behaviour, such as AWS Step. What are the gotchas with this solution?

When to use a boto3 client and when to use a boto3 resource?

I am trying to understand when I should use a Resource and when I should use a Client.
The definitions provided in boto3 docs don't really make it clear when it is preferable to use one or the other.
boto3.resource is a high-level services class wrap around boto3.client.
It is meant to attach connected resources under where you can later use other resources without specifying the original resource-id.
import boto3
s3 = boto3.resource("s3")
bucket = s3.Bucket('mybucket')
# now bucket is "attached" the S3 bucket name "mybucket"
print(bucket)
# s3.Bucket(name='mybucket')
print(dir(bucket))
#show you all class method action you may perform
OTH, boto3.client are low level, you don't have an "entry-class object", thus you must explicitly specify the exact resources it connects to for every action you perform.
It depends on individual needs. However, boto3.resource doesn't wrap all the boto3.client functionality, so sometime you need to call boto3.client , or use boto3.resource.meta.client to get the job done.
If possible use client over resource, especially if dealing with s3 object lists, and then trying to get basic information on those objects themselves.
Client calls s3 10,000/1000 = 10 times and gives you a lot of information on each object in each call..
Resource, I assume calls s3 10,000 times(or maybe same as client??), but if you take that object and try to do something with it, that is probably another call to s3, making this about 20x slower than client.
my Test reveals the following results.
s3 = boto3.resource("s3")
s3bucket = s3.Bucket(myBucket)
s3obj_list = s3bucket.objects.filter(Prefix=key_prefix)
tmp_list = [s3obj.key for s3obj in s3obj_list]
(tmp_list = [s3obj for s3obj in s3obj_list] gives same ~9min results)
When trying to get a list of 150,000 files, took ~9 minutes. If s3obj_list is indeed pulling 1000 files a call and buffering it, s3obj.key is probably not part of it and makes another call.
...some sort of loop, that also sets ContinuationToken...
response = client.list_objects_v2(
Bucket = bucket,
Prefix = prefix,
ContinuationToken=response["NextContinuationToken"],
)
...
Client took ~30 seconds to list the 150,000 files.
I don't know if resource buffers 1000 files at a time but if it doesn't that is a problem.
I also don't know if it is possible for resource to buffer the information attached to the object, but that is another problem.
I also don't know if using pagination could make client faster/easier to use.
Anyone who knows the answer to the 3 questions above please do. I'd be very interested to know.

Get arbitrary object from Riak bucket

Is there a way to get a random object from a specific bucket by using Riak's HTTP API? Let's say that you have no knowledge about the contents of a bucket, the only thing you know is that all objects in a bucket share a common data structure. What would be a good way to get any object from a bucket, in order to show its data structure? Preferably using MapReduce over Search, since Search will flatten the response.
The best option is to use predictable keys so you don't have to find them. Since that is not always possible, secondary indexing is the next best.
If you are using eLevelDB, you can query the $BUCKET implicit index with max_results set to 1, which will return a single key. You would then issue a get request for that key.
If you are using Bitcask, you have 2 options:
list all of the keys in the bucket
Key listing in Bitcask will need to fold over every value in all buckets in order to return the list of keys in a single bucket. Effectively this means reading your entire dataset from disk, so this is very heavy on the system and could bring a production cluster to its knees.
MapReduce
MapReduce over a full bucket uses a similar query to key listing so it is also very heave on the system. Since the map phase function is executed separately for each object, if your map phase returns an object, every object in the bucket would be passed over the network to the node running the reduce phase. Thus it would be more efficient (read: less disastrous) to have the map phase function return just the key with no data, then have your reduce phase return the first item in the list, which leaves you needing to issue a get request for the object once you have the key name.
While it is technically possible to find a key in a given bucket when you have no information about the keys or the contents, if you designed your system to create a key named <<"schema">> or <<"sample">> that contains a sample object in each bucket, you could simply issue a get request for that key instead of searching, folding, or mapping.
If you are using Riak 2.X then search (http://docs.basho.com/riak/latest/dev/using/search/) is recommended over Map Reduce or 2i queries in most use cases and it is available via the HTTP API.