When to use S3 API pagination - amazon-web-services

I'm using the boto3 client to access date stored in an Amazon S3 bucket. After reading the docs, I see that I can make a request with this code:
s3 = boto3.resource('s3')
bucket = s3.Bucket(TARGET_BUCKET)
for obj in bucket.objects.filter(Bucket=TARGET_BUCKET, Prefix=TARGET_KEYS + KEY_SEPARATOR):
print(obj)
I test against a bucket where I've stored 3000 objects and this fragment of code retrieves the references to all the objects. I've read that all the API calls to S3 return at most 1000 entries.
But reading the boto3 documentation, paginator section, I see that some S3 operations need to use pagination to retrieve all the results. I don't understand why the upper code works unless the code is using the paginator under the hood. And this is my question, can I safely assume that the upper code always will retrieve all the results.?

According to the documentation here, the pagination is handled for you.
A collection provides an iterable interface to a group of resources.
Collections behave similarly to Django QuerySets and expose a similar
API. A collection seamlessly handles pagination for you, making it
possible to easily iterate over all items from all pages of data.

Related

AWS S3: Cost of listing all object versions

In the scenario of listing all versions of an object using its key as a prefix:
import boto3
bucket = 'bucket name'
key = 'key'
s3 = boto3.resource('s3')
versions = s3.Bucket(bucket).object_versions.filter(Prefix=key)
for version in versions:
obj = version.get()
print(obj.get('VersionId'), obj.get('ContentLength'), obj.get('LastModified'))
Do I get charged only for listing the objects that are matching the prefix?
If so, is each object/version listed treated as a separate list request?
No, each object/version listed is not treated as a separate list request. You're only paying for the API requests to S3 (at something like $0.005 per 1000 API requests). A single API request will return many (up to 1000) objects/versions that match the indicated prefix. The prefix filtering itself happens server-side in S3.
The way to get a handle on this is to understand that AWS SDK calls ultimately result in API requests to AWS service endpoints e.g. S3 APIs. What you need to do is work out how your SDK client requests map to the underlying API requests to determine what is likely happening.
If your request is a simple 'list objects in my bucket' case, the boto3 SDK is going to make one or more ListObjectsV2 API calls. I say "or more" because the SDK may need to make more than one API request because API requests typically yield a maximum number of results (e.g. 1000 objects in a ListObjectsV2 response). If there are 2500 objects in the bucket, for example, then three ListObjectsV2 requests would need to be made to the S3 API.
If your request is 'list objects in my bucket with a given prefix', then you need to know what capabilities are present on the ListObjectsV2 API call. Importantly, prefix is one of the parameters. This is how you know that S3 itself is doing the filtering on your supplied prefix (where you have indicated .filter(Prefix=key) in your code). If this were not a feature of the underlying S3 API, then your SDK (boto3 etc.) would be the one doing the filtering on prefix and that would be a much more expensive and vastly slower operation, because the SDK would have to list all objects, potentially resulting in many more LIST requests, and filter them client-side. Note: the ListObjectVersions API is similar to ListObjectsV2 in this regard and both support prefix.
Also, note that VersionId, Size, and LastModifed are all attributes that appear in the ListObjectVersions response, so no further API requests are needed to fetch this information.
So, in your case, assuming that there are fewer than 1000 object versions that match your indicated prefix, I believe that this equates to one S3 API request to ListObjectVersions (and this is considered a LIST request rather than a GET request for billing afaik, even though it is a GET HTTP request to https://mybucket.s3.amazonaws.com/?versions under the covers).

Storing S3 Urls vs calling listObjects

I have an app that has an attachments feature for users. They can upload documents to S3 and then revisit and preview and/or Download said attachments.
I was planning on storing the S3 urls in DB and then pre-signing them when the User needs them. I'm finding a caveat here is that this can lead to edge cases between S3 and the DB.
I.e. if a file gets removed from S3 but its url does not get removed from DB (or vice-versa). This can lead to data inconsistency and may mislead users.
I was thinking of just getting the urls via the network by using listObjects in the s3 client SDK. I don't really need to store the urls and this guarantees the user gets what's actually in S3.
Only con here is that it makes 1 API request (as opposed to DB hit)
Any insights?
Thanks!
Using a database to store an index to files is a good idea, especially once the volume of objects increases. The ListObjects() API only returns 1000 objects per call. This might be okay if every user has their own path (so you can use ListObjects(Prefix='user1/'), but that's not ideal if you want to allow document sharing between users.
Using a database will definitely be faster to obtain a listing, and it has the advantage that you can filter on attributes and metadata.
The two systems will only get "out of sync" if objects are created/deleted outside of your app, or if there is an error in the app. If this concerns you, then use Amazon S3 Inventory, to provide a regular listing of objects in the bucket and write some code to compare it against the database entries. This will highlight if anything is going wrong.
While Amazon S3 is an excellent NoSQL database (Key = filename, Value = contents), it isn't good for searching/listing a large quantity of objects.

Listing all files in a Google Cloud Storage bucket and its pricing

When using the bucket.getFilesStream, which auto-paginates through the files in a bucket, is each page's worth of data request considered single Class A operations? Or is the entire stream using pagination considered a single Class A operation?
If it's multiple operations, is there a cheaper way to get a list of all files in a bucket, assuming there are millions of files?
According to the official Cloud Storage JSON API reference the method for listing the bucket objects is storage.objects.list. It retrieves a list of objects matching the specified criteria. This is the method used in the client libraries to retrieve list of object in the bucket. As long as this is the only method to achieve this, there isn't any workaround to list the buckets objects in a cheaper way.
As you can see in the Google Cloud Storage pricing documentation a call to this method is considered as a Class A operation. The number of calls would depend on how the node.js uses the JSON API.

When to use a boto3 client and when to use a boto3 resource?

I am trying to understand when I should use a Resource and when I should use a Client.
The definitions provided in boto3 docs don't really make it clear when it is preferable to use one or the other.
boto3.resource is a high-level services class wrap around boto3.client.
It is meant to attach connected resources under where you can later use other resources without specifying the original resource-id.
import boto3
s3 = boto3.resource("s3")
bucket = s3.Bucket('mybucket')
# now bucket is "attached" the S3 bucket name "mybucket"
print(bucket)
# s3.Bucket(name='mybucket')
print(dir(bucket))
#show you all class method action you may perform
OTH, boto3.client are low level, you don't have an "entry-class object", thus you must explicitly specify the exact resources it connects to for every action you perform.
It depends on individual needs. However, boto3.resource doesn't wrap all the boto3.client functionality, so sometime you need to call boto3.client , or use boto3.resource.meta.client to get the job done.
If possible use client over resource, especially if dealing with s3 object lists, and then trying to get basic information on those objects themselves.
Client calls s3 10,000/1000 = 10 times and gives you a lot of information on each object in each call..
Resource, I assume calls s3 10,000 times(or maybe same as client??), but if you take that object and try to do something with it, that is probably another call to s3, making this about 20x slower than client.
my Test reveals the following results.
s3 = boto3.resource("s3")
s3bucket = s3.Bucket(myBucket)
s3obj_list = s3bucket.objects.filter(Prefix=key_prefix)
tmp_list = [s3obj.key for s3obj in s3obj_list]
(tmp_list = [s3obj for s3obj in s3obj_list] gives same ~9min results)
When trying to get a list of 150,000 files, took ~9 minutes. If s3obj_list is indeed pulling 1000 files a call and buffering it, s3obj.key is probably not part of it and makes another call.
...some sort of loop, that also sets ContinuationToken...
response = client.list_objects_v2(
Bucket = bucket,
Prefix = prefix,
ContinuationToken=response["NextContinuationToken"],
)
...
Client took ~30 seconds to list the 150,000 files.
I don't know if resource buffers 1000 files at a time but if it doesn't that is a problem.
I also don't know if it is possible for resource to buffer the information attached to the object, but that is another problem.
I also don't know if using pagination could make client faster/easier to use.
Anyone who knows the answer to the 3 questions above please do. I'd be very interested to know.

python boto for aws s3, how to get sorted and limited files list in bucket?

If There are too many files on a bucket, and I want to get only 100 newest files,
How can I get only these list?
s3.bucket.list seems not to have that function. Is there anybody who know this?
please let me know. thanks.
There is no way to do this type of filtering on the service side. The S3 API does not support it. You might be able to accomplish something like this by using prefixes in your object names. For example, if you named all of your objects using a pattern like this:
YYYYMMDD/<objectname>
20140618/foobar (as an example)
you could use the prefix parameter of the ListBucket request in S3 to return only the object that were stored today. In boto, this would look like:
import boto
s3 = boto.connect_s3()
bucket = s3.get_bucket('mybucket')
for key in bucket.list(prefix='20140618'):
# do something with the key object
You would still have to retrieve all of the objects with that prefix and then sort them locally based on their last_modified_date but that would be much easier than listing all of the objects in the bucket and then sorting.
The other option would be to store metadata object the S3 objects in a database like DynamoDB and then query that database to find the objects to retrieve from S3.
You can find out more about hierarchical listing in S3 here
Can you try this code. This worked for me.
import boto,operator,time
con = boto.connect_s3()
key_repo = []
bucket = con.get_bucket('<your bucket name>')
bucket_keys = bucket.get_all_keys()
for object in bucket_keys:
t = (object.key,time.strptime(object.last_modified[:19], "%Y-%m-%dT%H:%M:%S"))
key_repo.append(t)
key_repo.sort(key=lambda item:item[1], reverse=1)
for key in key_repo[:10]: #top 10 items in the list
print key[0], ' ',key[1]
PS : I am beginner to Python so the code might not be optimized. Fell free to edit the answer to provide best code.