AWS S3: Cost of listing all object versions - amazon-web-services

In the scenario of listing all versions of an object using its key as a prefix:
import boto3
bucket = 'bucket name'
key = 'key'
s3 = boto3.resource('s3')
versions = s3.Bucket(bucket).object_versions.filter(Prefix=key)
for version in versions:
obj = version.get()
print(obj.get('VersionId'), obj.get('ContentLength'), obj.get('LastModified'))
Do I get charged only for listing the objects that are matching the prefix?
If so, is each object/version listed treated as a separate list request?

No, each object/version listed is not treated as a separate list request. You're only paying for the API requests to S3 (at something like $0.005 per 1000 API requests). A single API request will return many (up to 1000) objects/versions that match the indicated prefix. The prefix filtering itself happens server-side in S3.
The way to get a handle on this is to understand that AWS SDK calls ultimately result in API requests to AWS service endpoints e.g. S3 APIs. What you need to do is work out how your SDK client requests map to the underlying API requests to determine what is likely happening.
If your request is a simple 'list objects in my bucket' case, the boto3 SDK is going to make one or more ListObjectsV2 API calls. I say "or more" because the SDK may need to make more than one API request because API requests typically yield a maximum number of results (e.g. 1000 objects in a ListObjectsV2 response). If there are 2500 objects in the bucket, for example, then three ListObjectsV2 requests would need to be made to the S3 API.
If your request is 'list objects in my bucket with a given prefix', then you need to know what capabilities are present on the ListObjectsV2 API call. Importantly, prefix is one of the parameters. This is how you know that S3 itself is doing the filtering on your supplied prefix (where you have indicated .filter(Prefix=key) in your code). If this were not a feature of the underlying S3 API, then your SDK (boto3 etc.) would be the one doing the filtering on prefix and that would be a much more expensive and vastly slower operation, because the SDK would have to list all objects, potentially resulting in many more LIST requests, and filter them client-side. Note: the ListObjectVersions API is similar to ListObjectsV2 in this regard and both support prefix.
Also, note that VersionId, Size, and LastModifed are all attributes that appear in the ListObjectVersions response, so no further API requests are needed to fetch this information.
So, in your case, assuming that there are fewer than 1000 object versions that match your indicated prefix, I believe that this equates to one S3 API request to ListObjectVersions (and this is considered a LIST request rather than a GET request for billing afaik, even though it is a GET HTTP request to https://mybucket.s3.amazonaws.com/?versions under the covers).

Related

Can I load data directly from a S3 Bucket for detecting key phrases in the AWS SDK for Java?

I want to perform Key Phrase detection using AWS Comprehend.
Is there any way to load data directly from an S3 URI instead of manually loading data from S3 and passing it to the SDK?
Yes.
For Amazon Comprehend, there are usually 3 ways to do the same action:
Synchronous action for one document e.g. DetectKeyPhrases
Synchronous action for multiple documents e.g. BatchDetectKeyPhrases
Asynchronous action for multiple documents e.g. StartKeyPhrasesDetectionJob
Most, if not all, of the time the synchronous actions take in Text or TextList directly & the asynchronous operations allow you to specific an S3 URI.
For detecting key phrases, this would be the StartKeyPhrasesDetectionJob, which takes in S3Uri for input data as well as output data.
All of these operations are available in the AWS SDK for Java v2 so feel free to refer to the SDK documentation for getting started.

ListObjects operation's limit on AWS

I am going through the documentation of ListObjects function in AWS' go SDK.
(the same holds more or less for the actual API endpoint)
So the docs write:
Returns some or all (up to 1,000) of the objects in a bucket.
What does this mean? If my bucket has 200.000 objects this API call will not work?
This example uses ListObjectsPages (which calls ListObjects under the hood) and claims to list all objects.
What is the actual case here?
I am going through the documentation of ListObjects function in AWS' go SDK.
Use ListObjectsV2. It behaves more or less the same, but it's an updated version of ListObjects. It's not super common for AWS to update APIs, and when they do, it's usually for a good reason. They're great about backwards compatibility which is why ListObjects still exists.
This example uses ListObjectsPages (which calls ListObjects under the hood) and claims to list all objects.
ListObjectsPages is a paginated equivalent of ListObjects, and ditto for the V2 versions which I'll describe below.
Many AWS API responses are paginated. AWS uses Cursor Pagination; this means request responses include a cursor - ContinuationToken in the case of ListObjectsV2 . If more objects exist (IsTruncated in the response), a subsequent ListObjectsV2 request content can provide the ContinuationToken to continue the listing where the first response left off.
ListObjectsV2Pages handles the iterative ListObjectsV2 requests for you so you don't have to handle the logic of ContinuationToken and IsTruncated. Instead, you provide a function that will be invoked for each "page" in the response.
So it's accurate to say ListObjectsV2Pages will list "all" the objects, but it's because it makes multiple ListObjectsV2 calls in the backend that it will list more than one page of responses.
Thus, ...Pages functions can be considered convenience functions. You should always use them when appropriate - they take away the pain of pagination, and pagination is critical to make potentially high volume api responses operable. In AWS, if pagination is supported, assume you need it - in typical cases, the first page of results is not guaranteed to contain any results, even if subsequent pages do.
The AWS Go SDK V2 gives us paginator types to help us manage S3's per-query item limits. ListObjectsV2Pages is gone. In its place we get ListObjectsV2Paginator, which deals with the pagination details that #Daniel_Farrell mentioned.
The constructor accepts the same params as the list objects query (type ListObjectsV2Input). The paginator exposes 2 methods: HasMorePages: bool and NextPage: (*ListObjectsV2Output, error).
var items []Item
for p.HasMorePages() {
batch, err := p.NextPage(ctx)
// etc...
item = append(items, newItems...)
}

Storing S3 Urls vs calling listObjects

I have an app that has an attachments feature for users. They can upload documents to S3 and then revisit and preview and/or Download said attachments.
I was planning on storing the S3 urls in DB and then pre-signing them when the User needs them. I'm finding a caveat here is that this can lead to edge cases between S3 and the DB.
I.e. if a file gets removed from S3 but its url does not get removed from DB (or vice-versa). This can lead to data inconsistency and may mislead users.
I was thinking of just getting the urls via the network by using listObjects in the s3 client SDK. I don't really need to store the urls and this guarantees the user gets what's actually in S3.
Only con here is that it makes 1 API request (as opposed to DB hit)
Any insights?
Thanks!
Using a database to store an index to files is a good idea, especially once the volume of objects increases. The ListObjects() API only returns 1000 objects per call. This might be okay if every user has their own path (so you can use ListObjects(Prefix='user1/'), but that's not ideal if you want to allow document sharing between users.
Using a database will definitely be faster to obtain a listing, and it has the advantage that you can filter on attributes and metadata.
The two systems will only get "out of sync" if objects are created/deleted outside of your app, or if there is an error in the app. If this concerns you, then use Amazon S3 Inventory, to provide a regular listing of objects in the bucket and write some code to compare it against the database entries. This will highlight if anything is going wrong.
While Amazon S3 is an excellent NoSQL database (Key = filename, Value = contents), it isn't good for searching/listing a large quantity of objects.

Google Cloud CDN started ignoring query strings for storage buckets

Some months ago activated Cloud CDN for storage buckets. Our storage data is regularly changed via a backend. So to invalidate the cached version we added a query param with the changedDate to the url that is served to the client.
Back then this worked well.
Sometime in the last months (probably weeks) Google seemed to change that and is now ignoring the query string for caching from storage buckets.
First part: Does anyone know why this is changed and why noone was
notified about it?
Second part: How can you invalidate the Cache for a particular object
in a storage bucket without sending a cache-invalidation request
(which you shouldn't) everytime?
I don't like the idea of deleting the old file and uploading a new file with changed filename everytime something is uploaded...
EDIT:
for clarification: the official docu ( cloud.google.com/cdn/docs/caching ) already states that they now ignore query strings for storage buckets:
For backend buckets, the cache key consists of the URI without the query > string. Thus https://example.com/images/cat.jpg, https://example.com/images/cat.jpg?user=user1, and https://example.com/images/cat.jpg?user=user2 are equivalent.
We were affected by this also. After contacting Google Support, they have confirmed this is a permanent change. The recommended work around is to either use versioning in the object name, or use cache invalidation. The latter sounds a bit odd as the cache invalidation documentation states:
Invalidation is intended for use in exceptional circumstances, not as part of your normal workflow.
For backend buckets, the cache key consists of the URI without the query string, as the official documentation states.1 The bucket is not evaluating the query string but the CDN should still do that. I could reproduce this same scenario and currently is still possible to use a query string as cache buster.
Seems like the reason for the change is that the old behavior resulted in lost caching opportunities, higher costs and higher latency. The only recommended workaround for now is to create the new objects by incorporating the version into the object's name (which seems is not valid options for your case), or using cache invalidation.
Invalidating the cache for a particular object will require to use a particular query. Maybe a Cache-Control header allowing such objects to be cached for a certain time may be your workaround. Cloud CDN cache has an expiration time defined by the "Cache-Control: s-maxage", "Cache-Control: max-age", and/or Expires headers 2.
According to the doc, when using backend bucket as origin for Cloud CDN, query strings in the request URL are not included in the cache key:
For backend buckets, the cache key consists of the URI without the protocol, host, or query string.
Maybe using the query string to identify different versions of cached content is not the best practices promoted by GCP. But for some legacy issues, it has to be.
So, one way to workaround this is make backend bucket to be a static website (do NOT enable CDN here), then use custom origins (Cloud CDN backed by Internet network endpoint groups backend service) which points to that static website.
For backend service, query string IS part of cache key.
For backend services, Cloud CDN defaults to using the complete request URI as the cache key
That's it. Yes, It is tedious but works!

How to Access Object From Amazon s3 using getSignedUrl Operation

How to Access Object From Amazon s3 using getSignedUrl Operation
I`m able to generate Signed url using getSignedUrl method.
var url = s3.getSignedUrl('getObject', paramsurl);
using this url can i access full object from s3? i make http request but its only return 1000 as xml response. how to find next set of Objects and push to new array?
Your questions appears to be related to how many S3 objects are returned in a single ListObjects call.
If so, when you call the underlying APIs that list AWS resources, the API will typically return, by default, 1000 items. It will also return a 'next token' that you can use in a subsequent call to the same API to return the next batch of items.
Sometimes, you can also specify a 'max items' or 'max keys' in your request to allow you to override the default of 1000.
PS if you use the AWS SDKs then this batching of results is typically hidden from you.