I have 2 api calls I want to make to AWS:
put item into s3
write a row to DynamoDB
I'd like either both to happen, or if there's an error, neither to happen.
Is it possible to achieve that using boto3?
This isn't possible to do automatically. There is no facility to flag multiple actions in Boto3 as atomic. You will need to write code to check the response code, and also catch exceptions, from both of those actions, and then skip or roll-back the other action.
For example if you already successfully PUT an object to S3, but the DynamoDB insert fails, you would have to capture that failure, and then run an S3 delete operation.
Related
I'm working on preparing a disaster recovery plan for DynamoDB. In a DR situation we would create a temporary table to restore a snapshot to. From the Temp table we would copy data to a table that has been provisioned with IAC. Our DynamoDB tables have a stream and associated lambda trigger which would process all events that were copied in from the data in the temp table, this is unwanted and would cause a bunch of downstream issues.
Ideally I would like to disable the stream/ lambda trigger until the restore is complete and then enable and not process/ignore any of the changes from the copy/restore process.
I've read through DynamoDB stream documentation and it isn't clear to me if disabling the stream will clear events. It is my understanding that disabling/ enabling the DynamoDB stream although will provide you with a new arn is still the same stream behind the scenes and that log lives for 24 hours and once enable events would be sent to the lambda trigger.
Seems I might be able to configure the trigger on the lambda side, disable it, and then set the ShardIteratorType to 'LATEST' in order to prevent reading events from copied data.
Thanks in advance for any advice.
For anyone traveling down this path in the future I received some good answers on AWS Re:post here.
Some take aways:
When you disable/enable a stream it changes ARN and is completely separate to the old stream.
If the stream is disabled and you copy data to the table the stream log will remain empty once you reenable.
lambda event source mapping solution: LATEST will start reading from the the point of when you enable the trigger on the ESM. So if your copy puts events on the stream and you later create an ESM with iterator position as LATEST your Lambda will disregard all the data from the copy.
I am trying to build a Lambda function that gets triggered on S3 delete events. If multiple items are deleted at once, I want to use an S3 batch job. What I can't figure out or find in the documentation is what an event like that would look like. I'd assume it would just have multiple similar items in Records and I could iterate through, get all the keys, and then batch delete, but I can't confirm that. I've searched the documentation, and I built a test Lambda that would just log the event, but that came through as multiple distinct events. I'm stumped as to how to do what I'm trying here.
The s3 event you need to subscribe to is s3:ObjectRemoved:Delete that by documentation is used to track an object or a batch of objects being removed:
By using the ObjectRemoved event types, you can enable notification when an object or a batch of objects is removed from a bucket.
You can expect an event structured as detailed here.
However since in the comment you said you just wanted to "copy the objects pre-delete to another bucket" you may want to explore S3 buckets versioning capabilities.
Enabling versioning will allow you to preserve in a "deleted" state the objects, leaving room for future restores, as per delete workflow here.
I'm trying to use put_object_lock_configuration() API call to disable object locking on an Amazon S3 bucket using python boto3.
This is how I use it:
response = s3.put_object_lock_configuration(Bucket=bucket_name,
ObjectLockConfiguration={
'ObjectLockEnabled': 'Disabled'});
I always get exception with the following error.
botocore.exceptions.ClientError: An error occurred (MalformedXML) when calling the PutObjectLockConfiguration operation: The XML you provided was not well-formed or did not validate against our published schema
I suspect I miss the 2 parameters 'Token' and 'ContentMD5'. Does anyone know how do I get these values?
The only value of 'ObjectLockEnabled' allowed is 'Enabled'. My intention is to disable object lock. but this is not possible. because object lock is defined during bucket creation time and it can't be changed afterward. However, I can provide empty rule and the retention mode will become 'None', which is essentially no object lock.
Here is the boto3 code for blank retention rule, the precondition is to use mode=GOVERNANCE in the first place.
client.put_object_retention(
Bucket=bucket_name, Key=object_key,
Retention={},
BypassGovernanceRetention=True
)
I have a Lambda function that gets triggered whenever an object is created in s3 bucket.
Now, I need to trigger the Lambda for alternate object creation.
Lambda should not be triggered when object is created for the first, third , fifth and so on time. But, Lambda should be triggered for the second, fourth, sixth and so on time.
For this, I created an s3 event for 'PUT' operation.
The first time I used the PUT API. The second time I uploaded the file using -
s3_res.meta.client.upload_file
I thought that it would not trigger lambda since this was upload and not PUT. But this also triggered the Lambda.
Is there any way for this?
The reason that meta.client.upload_file is triggering your PUT event lambda is because it is actually using PUT.
upload_file (docs) uses the TransferManager client, which uses PUT under-the-hood (you can see this in the code: https://github.com/boto/s3transfer/blob/develop/s3transfer/upload.py)
Looking at the AWS-SDK you'll see that POST'ing to S3 is pretty much limited to when you want to give a browser/client a pre-signed URL for them to upload a file to. (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPOST.html)
If you want to count the number of times PUT has been called, in order to take action on every even call, then the easiest thing to do is to use something like DynamoDB to create a table of 'file-name' vs 'put-count' which you update with every PUT and action accordingly.
Alternatively you could enable bucket file versioning. Then you could use list_object_versions to see how many times the file has been updated. Although you should be aware that S3 is eventually consistent, so this may not be accurate if the file is being rapidly updated.
I am seeking advice on what's the best way to design this -
Use Case
I want to put multiple files into S3. Once all files are successfully saved, I want to trigger a lambda function to do some other work.
Naive Approach
The way I am approaching this is by saving a record in Dynamo that contains a unique identifier and the total number of records I will be uploading along with the keys that should exist in S3.
A basic implementation would be to take my existing lambda function which is invoked anytime my S3 bucket is written into, and have it check manually whether all the other files been saved.
The Lambda function would know (look in Dynamo to determine what we're looking for) and query S3 to see if the other files are in. If so, use SNS to trigger my other lambda that will do the other work.
Edit: Another approach is have my client program that puts the files in S3 be responsible for directly invoking the other lambda function, since technically it knows when all the files have been uploaded. The issue with this approach is that I do not want this to be the responsibility of the client program... I want the client program to not care. As soon as it has uploaded the files, it should be able to just exit out.
Thoughts
I don't think this is a good idea. Mainly because Lambda functions should be lightweight, and polling the database from within the Lambda function to get the S3 keys of all the uploaded files and then checking in S3 if they are there - doing this each time seems ghetto and very repetitive.
What's the better approach? I was thinking something like using SWF but am not sure if that's overkill for my solution or if it will even let me do what I want. The documentation doesn't show real "examples" either. It's just a discussion without much of a step by step guide (perhaps I'm looking in the wrong spot).
Edit In response to mbaird's suggestions below-
Option 1 (SNS) This is what I will go with. It's simple and doesn't really violate the Single Responsibility Principal. That is, the client uploads the files and sends a notification (via SNS) that its work is done.
Option 2 (Dynamo streams) So this is essentially another "implementation" of Option 1. The client makes a service call, which in this case, results in a table update vs. a SNS notification (Option 1). This update would trigger the Lambda function, as opposed to notification. Not a bad solution, but I prefer using SNS for communication rather than relying on a database's capability (in this case Dynamo streams) to call a Lambda function.
In any case, I'm using AWS technologies and have coupling with their offering (Lambda functions, SNS, etc.) but I feel relying on something like Dynamo streams is making it an even tighter coupling. Not really a huge concern for my use case but still feels dirty ;D
Option 3 with S3 triggers My concern here is the possibility of race conditions. For example, if multiple files are being uploaded by the client simultaneously (think of several async uploads fired off at once with varying file sizes), what if two files happen to finish uploading at around the same time, and two or more Lambda functions (or whatever implementations we use) query Dynamo and gets back N as the completed uploads (instead of N and N+1)? Now even though the final result should be N+2, each one would add 1 to N. Nooooooooooo!
So Option 1 wins.
If you don't want the client program responsible for invoking the Lambda function directly, then would it be OK if it did something a bit more generic?
Option 1: (SNS) What if it simply notified an SNS topic that it had completed a batch of S3 uploads? You could subscribe your Lambda function to that SNS topic.
Option 2: (DynamoDB Streams) What if it simply updated the DynamoDB record with something like an attribute record.allFilesUploaded = true. You could have your Lambda function trigger off the DynamoDB stream. Since you are already creating a DynamoDB record via the client, this seems like a very simple way to mark the batch of uploads as complete without having to code in knowledge about what needs to happen next. The Lambda function could then check the "allFilesUploaded" attribute instead of having to go to S3 for a file listing every time it is called.
Alternatively, don't insert the DynamoDB record until all files have finished uploading, then your Lambda function could just trigger off new records being created.
Option 3: (continuing to use S3 triggers) If the client program can't be changed from how it works today, then instead of listing all the S3 files and comparing them to the list in DynamoDB each time a new file appears, simply update the DynamoDB record via an atomic counter. Then compare the result value against the size of the file list. Once the values are the same you know all the files have been uploaded. The down side to this is that you need to provision enough capacity on your DynamoDB table to handle all the updates, which is going to increase your costs.
Also, I agree with you that SWF is overkill for this task.