ListObjects operation's limit on AWS - amazon-web-services

I am going through the documentation of ListObjects function in AWS' go SDK.
(the same holds more or less for the actual API endpoint)
So the docs write:
Returns some or all (up to 1,000) of the objects in a bucket.
What does this mean? If my bucket has 200.000 objects this API call will not work?
This example uses ListObjectsPages (which calls ListObjects under the hood) and claims to list all objects.
What is the actual case here?

I am going through the documentation of ListObjects function in AWS' go SDK.
Use ListObjectsV2. It behaves more or less the same, but it's an updated version of ListObjects. It's not super common for AWS to update APIs, and when they do, it's usually for a good reason. They're great about backwards compatibility which is why ListObjects still exists.
This example uses ListObjectsPages (which calls ListObjects under the hood) and claims to list all objects.
ListObjectsPages is a paginated equivalent of ListObjects, and ditto for the V2 versions which I'll describe below.
Many AWS API responses are paginated. AWS uses Cursor Pagination; this means request responses include a cursor - ContinuationToken in the case of ListObjectsV2 . If more objects exist (IsTruncated in the response), a subsequent ListObjectsV2 request content can provide the ContinuationToken to continue the listing where the first response left off.
ListObjectsV2Pages handles the iterative ListObjectsV2 requests for you so you don't have to handle the logic of ContinuationToken and IsTruncated. Instead, you provide a function that will be invoked for each "page" in the response.
So it's accurate to say ListObjectsV2Pages will list "all" the objects, but it's because it makes multiple ListObjectsV2 calls in the backend that it will list more than one page of responses.
Thus, ...Pages functions can be considered convenience functions. You should always use them when appropriate - they take away the pain of pagination, and pagination is critical to make potentially high volume api responses operable. In AWS, if pagination is supported, assume you need it - in typical cases, the first page of results is not guaranteed to contain any results, even if subsequent pages do.

The AWS Go SDK V2 gives us paginator types to help us manage S3's per-query item limits. ListObjectsV2Pages is gone. In its place we get ListObjectsV2Paginator, which deals with the pagination details that #Daniel_Farrell mentioned.
The constructor accepts the same params as the list objects query (type ListObjectsV2Input). The paginator exposes 2 methods: HasMorePages: bool and NextPage: (*ListObjectsV2Output, error).
var items []Item
for p.HasMorePages() {
batch, err := p.NextPage(ctx)
// etc...
item = append(items, newItems...)
}

Related

AWS S3: Cost of listing all object versions

In the scenario of listing all versions of an object using its key as a prefix:
import boto3
bucket = 'bucket name'
key = 'key'
s3 = boto3.resource('s3')
versions = s3.Bucket(bucket).object_versions.filter(Prefix=key)
for version in versions:
obj = version.get()
print(obj.get('VersionId'), obj.get('ContentLength'), obj.get('LastModified'))
Do I get charged only for listing the objects that are matching the prefix?
If so, is each object/version listed treated as a separate list request?
No, each object/version listed is not treated as a separate list request. You're only paying for the API requests to S3 (at something like $0.005 per 1000 API requests). A single API request will return many (up to 1000) objects/versions that match the indicated prefix. The prefix filtering itself happens server-side in S3.
The way to get a handle on this is to understand that AWS SDK calls ultimately result in API requests to AWS service endpoints e.g. S3 APIs. What you need to do is work out how your SDK client requests map to the underlying API requests to determine what is likely happening.
If your request is a simple 'list objects in my bucket' case, the boto3 SDK is going to make one or more ListObjectsV2 API calls. I say "or more" because the SDK may need to make more than one API request because API requests typically yield a maximum number of results (e.g. 1000 objects in a ListObjectsV2 response). If there are 2500 objects in the bucket, for example, then three ListObjectsV2 requests would need to be made to the S3 API.
If your request is 'list objects in my bucket with a given prefix', then you need to know what capabilities are present on the ListObjectsV2 API call. Importantly, prefix is one of the parameters. This is how you know that S3 itself is doing the filtering on your supplied prefix (where you have indicated .filter(Prefix=key) in your code). If this were not a feature of the underlying S3 API, then your SDK (boto3 etc.) would be the one doing the filtering on prefix and that would be a much more expensive and vastly slower operation, because the SDK would have to list all objects, potentially resulting in many more LIST requests, and filter them client-side. Note: the ListObjectVersions API is similar to ListObjectsV2 in this regard and both support prefix.
Also, note that VersionId, Size, and LastModifed are all attributes that appear in the ListObjectVersions response, so no further API requests are needed to fetch this information.
So, in your case, assuming that there are fewer than 1000 object versions that match your indicated prefix, I believe that this equates to one S3 API request to ListObjectVersions (and this is considered a LIST request rather than a GET request for billing afaik, even though it is a GET HTTP request to https://mybucket.s3.amazonaws.com/?versions under the covers).

Is Amazon S3's ListObjectsV2 self-consistent over multiple pages?

ListObjectsV2 can only return 1000 results, at which point you have to go back for another page.
Since Amazon S3 is now strongly consistent, and other updates can be happening to the bucket while I am listing its contents, is the second page going to be more results from the same point in time as the first page? Or is it going to reflect the state of the bucket at the point in time when the second page was requested?
For example, if I list a bucket, get the first page, delete a key which would have appeared on the second page, and then get the second page, will I still see the key that is now deleted?
Indeed, Amazon S3 is now strongly consistent. This means once you upload an object, all people that read that object are guaranteed to get the updated version of the object. This does not meant that two different API calls are guaranteed to be in the same "state". Notably, for downloads, there is a situation where one download can get parts of two versions of the object if it's updated while being downloaded. More details are available in this answer.
As for you question, the same basic rules apply: S3 is strongly consistent from one call to the next, once you make a change to the bucket or objects, any call after that update is guaranteed to get the updated data. This means as you page through the list of objects, you will see the changes as each API call gets the latest state:
import boto3
BUCKET='example-bucket'
PREFIX='so_question'
s3 = boto3.client('s3')
# Create a bunch of items
for i in range(3000):
s3.put_object(Bucket=BUCKET, Key=f"{PREFIX}/obj_{i:04d}", Body=b'')
args = {'Bucket': BUCKET, 'Prefix': PREFIX + "/",}
result = s3.list_objects_v2(**args)
# This shows objects 0 to 999
print([x['Key'] for x in result['Contents']])
# Delete an object
s3.delete_object(Bucket=BUCKET, Key=f"{PREFIX}/obj_{1100:04d}")
# Request the next "page" of items
args['ContinuationToken'] = result['NextContinuationToken']
result = s3.list_objects_v2(**args)
# This will not show object 1100, showing objects 1000 to 2000
print([x['Key'] for x in result['Contents']])
The upside of this and there's no way to get a list of all objects in a bucket (assuming it has more than 1000 items) in one API call: there's no way I'm aware of to get a complete "snapshot" of the bucket at any point, unless you can ensure the bucket doesn't change during listing the objects, of course.

Is it always necessary to check isTruncated in S3 ListObjects / ListObjectsV2 responses?

S3's ListObjects and ListObjectsV2 API responses both include an IsTruncated response element, which (according to the V1 API docs)
Specifies whether (true) or not (false) all of the results were returned. If the number of results exceeds that specified by MaxKeys, all of the results might not be returned.
According to the Listing Objects Keys section of the S3 documentation:
As buckets can contain a virtually unlimited number of keys, the complete results of a list query can be extremely large. To manage large result sets, the Amazon S3 API supports pagination to split them into multiple responses. Each list keys response returns a page of up to 1,000 keys with an indicator indicating if the response is truncated. You send a series of list keys requests until you have received all the keys. AWS SDK wrapper libraries provide the same pagination.
Clearly we need to check isTruncated if there's a possibility that the listing could match more than 1000 keys. Similarly, if we explicitly set MaxKeys then we definitely need to check isTruncated if there's ever the possibility that a listing could match more than MaxKeys keys.
However, do we need to check isTruncated if we never expect there to be more than min(1000, MaxKeys) matching keys?
I think that the weakest possible interpretation of the S3 API docs is that S3 will return at most min(1000, MaxKeys) keys per listing call but technically can return fewer keys even if more matching keys exist and would fit in the response. For example, if there are 10 matchings keys and MaxKeys == 1000 then it would be technically valid for S3 to return, say, 3 keys in the first API response and 7 in the second. (Technically I suppose it could even return zero keys and set isTruncated = true, but that behavior seems unlikely).
With these weak semantics I think we always need to check isTruncated, even if we're listing what we expect to be a very small number of keys. As a corollary, any code which doesn't check isTruncated is (most likely) buggy.
In the past, I've observed this listing semantic from other AWS APIs (including the EC2 Reserved Instance Marketplace API).
Is this a correct interpretation of the S3 API semantics? Or does S3 actually guarantee (but not document) stronger semantics (e.g. "if more than MaxKeys keys match the listing then the listing will contain exactly MaxKeys)?
I'm especially interested in answers which cite official AWS sources (such as AWS forum responses, SDK issues, etc).
In my experience it will always return the maximum number of values, which is as you state it: min(1000, MaxKeys)
So, if you know you will always have under 1000 results, you would not need to check isTruncated.
Mind you, it's fairly easy to construct a while loop to do so. (Probably easier than writing this question!)

How to invalidate AWS APIGateway cache

We have a service which inserts into dynamodb certain values. For sake of this question let's say its key:value pair i.e., customer_id:customer_email. The inserts don't happen that frequently and once the inserts are done, that specific key doesn't get updated.
What we have done is create a client library which, provided with customer_id will fetch customer_email from dynamodb.
Given that customer_id data is static, what we were thinking is to add cache to the table but one thing which we are not sure that what will happen in the following use-case
client_1 uses our library to fetch customer_email for customer_id = 2.
The customer doesn't exist so API Gateway returns not found
APIGateway will cache this response
For any subsequent calls, this cached response will be sent
Now another system inserts customer_id = 2 with its email id. This system doesn't know if this response has been cached previously or not. It doesn't even know that any other system has fetched this specific data. How can we invalidate cache for this specific customer_id when it gets inserted into dynamodb
You can send a request to the API endpoint with a Cache-Control: max-age=0 header which will cause it to refresh.
This could open your application up to attack as a bad actor can simply flood an expensive endpoint with lots of traffic and buckle your servers/database. In order to safeguard against that it's best to use a signed request.
In case it's useful to people, here's .NET code to create the signed request:
https://gist.github.com/secretorange/905b4811300d7c96c71fa9c6d115ee24
We've built a Lambda which takes care of re-filling cache with updated results. It's a quite manual process, with very little re-usable code, but it works.
Lambda is triggered by the application itself following application needs. For example, in CRUD operations the Lambda is triggered upon successful execution of POST, PATCH and DELETE on a specific resource, in order to clear the general GET request (i.e. clear GET /books whenever POST /book succeeded).
Unfortunately, if you have a View with a server-side paginated table you are going to face all sorts of issues because invalidating /books is not enough since you actually may have /books?page=2, /books?page=3 and so on....a nightmare!
I believe APIG should allow for more granular control of cache entries, otherwise many use cases aren't covered. It would be enough if they would allow to choose a root cache group for each request, so that we could manage cache entries by group rather than by single request (which, imho, is also less common).
Did you look at this https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-caching.html ?
There is way to invalidate entire cache or a particular cache entry

How to Access Object From Amazon s3 using getSignedUrl Operation

How to Access Object From Amazon s3 using getSignedUrl Operation
I`m able to generate Signed url using getSignedUrl method.
var url = s3.getSignedUrl('getObject', paramsurl);
using this url can i access full object from s3? i make http request but its only return 1000 as xml response. how to find next set of Objects and push to new array?
Your questions appears to be related to how many S3 objects are returned in a single ListObjects call.
If so, when you call the underlying APIs that list AWS resources, the API will typically return, by default, 1000 items. It will also return a 'next token' that you can use in a subsequent call to the same API to return the next batch of items.
Sometimes, you can also specify a 'max items' or 'max keys' in your request to allow you to override the default of 1000.
PS if you use the AWS SDKs then this batching of results is typically hidden from you.