When to use a boto3 client and when to use a boto3 resource? - amazon-web-services

I am trying to understand when I should use a Resource and when I should use a Client.
The definitions provided in boto3 docs don't really make it clear when it is preferable to use one or the other.

boto3.resource is a high-level services class wrap around boto3.client.
It is meant to attach connected resources under where you can later use other resources without specifying the original resource-id.
import boto3
s3 = boto3.resource("s3")
bucket = s3.Bucket('mybucket')
# now bucket is "attached" the S3 bucket name "mybucket"
print(bucket)
# s3.Bucket(name='mybucket')
print(dir(bucket))
#show you all class method action you may perform
OTH, boto3.client are low level, you don't have an "entry-class object", thus you must explicitly specify the exact resources it connects to for every action you perform.
It depends on individual needs. However, boto3.resource doesn't wrap all the boto3.client functionality, so sometime you need to call boto3.client , or use boto3.resource.meta.client to get the job done.

If possible use client over resource, especially if dealing with s3 object lists, and then trying to get basic information on those objects themselves.
Client calls s3 10,000/1000 = 10 times and gives you a lot of information on each object in each call..
Resource, I assume calls s3 10,000 times(or maybe same as client??), but if you take that object and try to do something with it, that is probably another call to s3, making this about 20x slower than client.
my Test reveals the following results.
s3 = boto3.resource("s3")
s3bucket = s3.Bucket(myBucket)
s3obj_list = s3bucket.objects.filter(Prefix=key_prefix)
tmp_list = [s3obj.key for s3obj in s3obj_list]
(tmp_list = [s3obj for s3obj in s3obj_list] gives same ~9min results)
When trying to get a list of 150,000 files, took ~9 minutes. If s3obj_list is indeed pulling 1000 files a call and buffering it, s3obj.key is probably not part of it and makes another call.
...some sort of loop, that also sets ContinuationToken...
response = client.list_objects_v2(
Bucket = bucket,
Prefix = prefix,
ContinuationToken=response["NextContinuationToken"],
)
...
Client took ~30 seconds to list the 150,000 files.
I don't know if resource buffers 1000 files at a time but if it doesn't that is a problem.
I also don't know if it is possible for resource to buffer the information attached to the object, but that is another problem.
I also don't know if using pagination could make client faster/easier to use.
Anyone who knows the answer to the 3 questions above please do. I'd be very interested to know.

Related

How to mass / bulk PutObjectAcl in Amazon S3?

I am looking for a way to update several objects ACL in one (of few request) to the AWS API.
My web application contains several sensitive objects stored in AWS S3. This object have a default ACL to "private". I sometimes need to update several objects ACL to "public-read" for some time (a couple of minutes) before going back to "private".
For a couple of objects, one request per object to PutObjectAcl is ok. But when dealing with several objects (hundreds), the operation requires to much time.
My question is : how can I "mass put object acl" or "bulk put object acl" ? The AWS API doesn't contain a specific answer, like DeleteObjects (which allows to delete several objects at once). But may be I didn't look in the right place ?!
Any tricks or way to work around that would be of great value !
Mixing private and public objects inside a bucket is usually a bad idea. If you only need those objects to be public for a couple of minutes, you can create a pre-signed GET URL and set a desired expiration time.

Storing S3 Urls vs calling listObjects

I have an app that has an attachments feature for users. They can upload documents to S3 and then revisit and preview and/or Download said attachments.
I was planning on storing the S3 urls in DB and then pre-signing them when the User needs them. I'm finding a caveat here is that this can lead to edge cases between S3 and the DB.
I.e. if a file gets removed from S3 but its url does not get removed from DB (or vice-versa). This can lead to data inconsistency and may mislead users.
I was thinking of just getting the urls via the network by using listObjects in the s3 client SDK. I don't really need to store the urls and this guarantees the user gets what's actually in S3.
Only con here is that it makes 1 API request (as opposed to DB hit)
Any insights?
Thanks!
Using a database to store an index to files is a good idea, especially once the volume of objects increases. The ListObjects() API only returns 1000 objects per call. This might be okay if every user has their own path (so you can use ListObjects(Prefix='user1/'), but that's not ideal if you want to allow document sharing between users.
Using a database will definitely be faster to obtain a listing, and it has the advantage that you can filter on attributes and metadata.
The two systems will only get "out of sync" if objects are created/deleted outside of your app, or if there is an error in the app. If this concerns you, then use Amazon S3 Inventory, to provide a regular listing of objects in the bucket and write some code to compare it against the database entries. This will highlight if anything is going wrong.
While Amazon S3 is an excellent NoSQL database (Key = filename, Value = contents), it isn't good for searching/listing a large quantity of objects.

Boto3, S3 check if keys exist

Right now I do know how to check if a single key exists within my S3 bucket using Boto 3:
res = s3.list_objects_v2(Bucket=record.bucket_name, Prefix='back.jpg', Delimiter='/')
for obj in res.get('Contents', []):
print(obj)
However I'm wondering if it's possible to check if multiple keys exist within a single API call. It feels a bit of a waste to do 5+ requests for that.
You could either use head_object() to check whether a specific object exists, or retrieve the complete bucket listing using list_objects_v2() and then look through the returned list to check for multiple objects.
Please note that list_objects_v2() only returns 1000 objects at a time, so it might need several calls to retrieve a list of all objects.

AWS - want to upload multiple files to S3 and only when all are uploaded trigger a lambda function

I am seeking advice on what's the best way to design this -
Use Case
I want to put multiple files into S3. Once all files are successfully saved, I want to trigger a lambda function to do some other work.
Naive Approach
The way I am approaching this is by saving a record in Dynamo that contains a unique identifier and the total number of records I will be uploading along with the keys that should exist in S3.
A basic implementation would be to take my existing lambda function which is invoked anytime my S3 bucket is written into, and have it check manually whether all the other files been saved.
The Lambda function would know (look in Dynamo to determine what we're looking for) and query S3 to see if the other files are in. If so, use SNS to trigger my other lambda that will do the other work.
Edit: Another approach is have my client program that puts the files in S3 be responsible for directly invoking the other lambda function, since technically it knows when all the files have been uploaded. The issue with this approach is that I do not want this to be the responsibility of the client program... I want the client program to not care. As soon as it has uploaded the files, it should be able to just exit out.
Thoughts
I don't think this is a good idea. Mainly because Lambda functions should be lightweight, and polling the database from within the Lambda function to get the S3 keys of all the uploaded files and then checking in S3 if they are there - doing this each time seems ghetto and very repetitive.
What's the better approach? I was thinking something like using SWF but am not sure if that's overkill for my solution or if it will even let me do what I want. The documentation doesn't show real "examples" either. It's just a discussion without much of a step by step guide (perhaps I'm looking in the wrong spot).
Edit In response to mbaird's suggestions below-
Option 1 (SNS) This is what I will go with. It's simple and doesn't really violate the Single Responsibility Principal. That is, the client uploads the files and sends a notification (via SNS) that its work is done.
Option 2 (Dynamo streams) So this is essentially another "implementation" of Option 1. The client makes a service call, which in this case, results in a table update vs. a SNS notification (Option 1). This update would trigger the Lambda function, as opposed to notification. Not a bad solution, but I prefer using SNS for communication rather than relying on a database's capability (in this case Dynamo streams) to call a Lambda function.
In any case, I'm using AWS technologies and have coupling with their offering (Lambda functions, SNS, etc.) but I feel relying on something like Dynamo streams is making it an even tighter coupling. Not really a huge concern for my use case but still feels dirty ;D
Option 3 with S3 triggers My concern here is the possibility of race conditions. For example, if multiple files are being uploaded by the client simultaneously (think of several async uploads fired off at once with varying file sizes), what if two files happen to finish uploading at around the same time, and two or more Lambda functions (or whatever implementations we use) query Dynamo and gets back N as the completed uploads (instead of N and N+1)? Now even though the final result should be N+2, each one would add 1 to N. Nooooooooooo!
So Option 1 wins.
If you don't want the client program responsible for invoking the Lambda function directly, then would it be OK if it did something a bit more generic?
Option 1: (SNS) What if it simply notified an SNS topic that it had completed a batch of S3 uploads? You could subscribe your Lambda function to that SNS topic.
Option 2: (DynamoDB Streams) What if it simply updated the DynamoDB record with something like an attribute record.allFilesUploaded = true. You could have your Lambda function trigger off the DynamoDB stream. Since you are already creating a DynamoDB record via the client, this seems like a very simple way to mark the batch of uploads as complete without having to code in knowledge about what needs to happen next. The Lambda function could then check the "allFilesUploaded" attribute instead of having to go to S3 for a file listing every time it is called.
Alternatively, don't insert the DynamoDB record until all files have finished uploading, then your Lambda function could just trigger off new records being created.
Option 3: (continuing to use S3 triggers) If the client program can't be changed from how it works today, then instead of listing all the S3 files and comparing them to the list in DynamoDB each time a new file appears, simply update the DynamoDB record via an atomic counter. Then compare the result value against the size of the file list. Once the values are the same you know all the files have been uploaded. The down side to this is that you need to provision enough capacity on your DynamoDB table to handle all the updates, which is going to increase your costs.
Also, I agree with you that SWF is overkill for this task.

When using AWS SQS, is there any reason to prefer using GetQueueUrl to building a queue url from the region, account id, and name?

I have an application that uses a single SQS queue.
For the sake of flexibility I would like to configure the application using the queue name, SQS region, and AWS account id (as well as the normal AWS credentials and so forth), rather than giving a full queue url.
Does it make any sense to use GetQueueUrl to retrieve a url for the queue when I can just build it with something like the following (in ruby):
region = ENV['SQS_REGION'] # 'us-west-2'
account_id = ENV['SQS_AWS_ACCOUNT_ID'] # '773083218405'
queue_name = ENV['SQS_QUEUE_NAME'] # 'test3'
queue_url = "https://sqs.#{region}.amazonaws.com/#{account_id}/#{queue_name}
# => https://sqs.us-west-2.amazonaws.com/773083218405/test3
Possible reasons that it might not:
Amazon might change their url format.
Others???
I don't think you have any guarantee that the URL will have such a form. The official documentation states the GetQueueUrl call as the official method for obtaining queue urls. So while constructing it using the method above may be a very good guess, it may also fail at any time because Amazon can change the URL scheme (e.g. for new queues).
If Amazon changes the queue URL in a breaking way it will not be immediate and will be deprecated slowly, and will take effect moving up a version (i.e. when you upgrade your SDK).
While the documentation doesn't guarantee it, Amazon knows that it would be a massively breaking change for thousands of customers.
Furthermore, lots of customers use hard coded queue URLs which they get from the console, so those customers would not get the updated queue URL format either.
In the end, you will be safe either way. If you have LOTs of queues, then you will be better off formatting them yourself. If you have a small number of queues, then it shouldn't make much difference either way.
I believe for safety purposes the best way to get the URL is through the sqs.queue.named method. What you can do is memoize the queues by name to avoid multiple calls, something like that:
# https://github.com/phstc/shoryuken/blob/master/lib/shoryuken/client.rb
class Client
##queues = {}
class << self
def queues(queue)
##queues[queue.to_s] ||= sqs.queues.named(queue)
end
end
end