Configure s3 event for alternate PUT operation - amazon-web-services

I have a Lambda function that gets triggered whenever an object is created in s3 bucket.
Now, I need to trigger the Lambda for alternate object creation.
Lambda should not be triggered when object is created for the first, third , fifth and so on time. But, Lambda should be triggered for the second, fourth, sixth and so on time.
For this, I created an s3 event for 'PUT' operation.
The first time I used the PUT API. The second time I uploaded the file using -
s3_res.meta.client.upload_file
I thought that it would not trigger lambda since this was upload and not PUT. But this also triggered the Lambda.
Is there any way for this?

The reason that meta.client.upload_file is triggering your PUT event lambda is because it is actually using PUT.
upload_file (docs) uses the TransferManager client, which uses PUT under-the-hood (you can see this in the code: https://github.com/boto/s3transfer/blob/develop/s3transfer/upload.py)
Looking at the AWS-SDK you'll see that POST'ing to S3 is pretty much limited to when you want to give a browser/client a pre-signed URL for them to upload a file to. (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPOST.html)
If you want to count the number of times PUT has been called, in order to take action on every even call, then the easiest thing to do is to use something like DynamoDB to create a table of 'file-name' vs 'put-count' which you update with every PUT and action accordingly.
Alternatively you could enable bucket file versioning. Then you could use list_object_versions to see how many times the file has been updated. Although you should be aware that S3 is eventually consistent, so this may not be accurate if the file is being rapidly updated.

Related

How to move file to a sublevel in s3 without triggering lambda?

An excel file (say my_excel_file.xlsx) will be uploaded to s3://my-bucket/a/b/
A trigger is set in lambda with following properties:
bucket-name: my-bucket, prefix: a/b/
I want my lambda to:
Read the excel file uploaded to s3://my-bucket/a/b/ into a pandas dataframe
After processing it, move the excel file to s3://my-bucket/a/b/archive/ with the name: my_excel_file_timestamp.xlsx
In case I am able to achieve the above step, will the lambda get invoked recursively? If yes, is there a workaround?
Since Amazon S3 event is configured to trigger on prefix a/b/, then it will trigger the AWS Lambda function when an object is placed into a/b/archive/.
I recommend adding a line of code at the top of the Lambda function that checks the Key, which is passed to the function via the event parameter. It should check if the Key starts with a/b/archive/ (or similar rule) -- if so, it should exit the function immediately. This will not incur a significant cost because it will exit quickly and Lambda is only charged per millisecond.
The alternative is to put your archive folder in a different location.

Automated Real Time Data Processing on AWS with Lambda

I am interested in doing automated real-time data processing on AWS using Lambda and I am not certain about how I can trigger my Lambda function. My data processing code involves taking multiple files and concatenating them into a single data frame after performing calculations on each file. Since files are uploaded simultaneously onto S3 and files are dependent on each other, I would like the Lambda to be only triggered when all files are uploaded.
Current Approaches/Attempts:
-I am considering an S3 trigger, but my concern is that an S3 Trigger will result in an error in the case where a single file upload triggers the Lambda to start. An alternate option would be adding a wait time but that is not preferred to limit the computation resources used.
-A scheduled trigger using Cloudwatch/EventBridge, but this would not be real-time processing.
-SNS trigger, but I am not certain if the message can be automated without knowing the completion in file uploads.
Any suggestion is appreciated! Thank you!
If you really cannot do it with a scheduled function, the best option is to trigger a Lambda function when an object is created.
The tricky bit is that it will fire your function on each object upload. So you either can identify the "last part", e.g., based on some meta data, or you will need to store and track the state of all uploads, e.g. in a DynamoDB, and do the actual processing only when a batch is complete.
Best, Stefan
Your file coming in parts might be named as -
filename_part1.ext
filename_part2.ext
If any of your systems is generating those files, then use the system to generate a final dummy blank file name as -
filename.final
Since in your S3 event trigger you can use a suffix to generate an event, use .final extension to invoke lambda, and process records.
In an alternative approach, if you do not have access to the server putting objects to your s3 bucket, then with each PUT operation in your s3 bucket, invoke the lambda and insert an entry in dynamoDB.
You need to put a unique entry per file (not file parts) in dynamo with -
filename and last_part_recieved_time
The last_part_recieved_time keeps getting updated till you keep getting the file parts.
Now, this table can be looked up by a cron lambda invocation which checks if the time skew (time difference between SYSTIME of lambda invocation and dynamoDB entry - last_part_recieved_time) is enough to process the records.
I will still prefer to go with the first approach as the second one still has a chance for error.
Since you want this to be as real time as possible, perhaps you could just perform your logic every single time a file is uploaded, updating the version of the output as new files are added, and iterating through an S3 prefix per grouping of files, like in this other SO answer.
In terms of the architecture, you could add in an SQS queue or two to make this more resilient. An S3 Put Event can trigger an SQS message, which can trigger a Lambda function, and you can have error handling logic in the Lambda function that puts that event in a secondary queue with a visibility timeout (sort of like a backoff strategy) or back in the same queue for retries.

AWS Lambda function getting called repeatedly

I have written a Lambda function which gets invoked automatically when a file comes into my S3 bucket.
I perform certain validations on this file, modify the particular and put the file at the same location.
Due to this "put", my lambda is called again and the process goes on till my lambda execution times out.
Is there any way to trigger this lambda only once?
I found an approach where I can store the file name in DynamoDB and can apply a check in lambda function, but can there be any other approach where DynamoDB's use can be avoided?
You have a couple options:
You can put the file to a different location in s3 and delete the original
You can add a metadata field to the s3 object when you update it. Then check for the presence of that field in s3 so you know if you have processed it already. Now this might not work perfectly since s3 does not always provide the most recent data on reads after updates.
AWS allows different type of s3 event triggers. You can try playing s3:ObjectCreated:Put vs s3:ObjectCreated:Post.
You can upload your files in a folder, say
s3://bucket-name/notvalidated
and store the validated in another folder, say
s3://bucket-name/validated.
Update your S3 Event notification to invoke your lambda function whenever there is a ObjectCreate(All) event in the /notvalidated prefix.
The second answer does not seem to be correct (put vs post) - there is not really a concept of update in S3 in terms of POST or PUT. The request to update an object will be the same as the initial POST of the object. See here for details on the available S3 events.
I had this exact problem last year - I was doing an image resize on PUT and every time a file was overwritten, it would be triggered again. My recommended solution would be to have two folders in your s3 bucket - one for the original file and one for the finalized file. You could then create the lambda trigger with the lambda prefix so it only checks the files in the original folder
The events are triggered in S3 based on if the object is put/post/copy/complete Multipart Upload - All these operations corresponds to ObjectCreate as per AWS documentation .
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
The best solution is to restrict your S3 object create event to particular bucket location. So that any change in that bucket location will trigger lambda function.
You can do the modification in some other bucket location which is not configured to trigger lambda function when object is created in that location.
Hope it helps!

Iterating a list of lists within Python for aws lambda

I have a question related to python.
The use case is to write a python function (in aws lambda) which will look for a set of files in multiple buckets and return some action like creating a dummy file in s3 bucket or triggering another lambda.
For eg:
list1=[file1,file2,file3]
list2=[file4,file5,file6]
list3=[f7,f8,f9]
def lambda_handler(event,context):
if len(list1)==9:
print("something")
//create dummy file in s3 OR, trigger another lambda
elif len(list2)==9:
print("Something")
else:
print("all files are not available")
and like wise.
I am a bit confused about how to do iteration within the 3lists and trigger one lambda for one set of file say list1 or list2 or list3. Or alternatively I can create a dummy file in s3.
Can anyone please help me with the way to do it?
I would recommend this architecture:
Create an AWS Lambda function that is triggered whenever any file is added to the S3 bucket
The Lambda function will receive details of the file that was added (which caused the function to be triggered)
The function can then check whether all the associated files are also present
If they are not present, it simply exits and does nothing
If they are present, it can then do the desired processing or invoke another Lambda function to do the processing
This way, things only happen when files are retrieved, rather than having to check every n minutes. Also, it will only be triggered on new files arriving, rather than having to overlook existing files that have already been processed or are awaiting other files.
The only potential danger is if all the desired files arrive in a short time space. Each file would trigger a separate lambda function and each of them might see that all files are available and then attempt to trigger the next process. So, be a little careful around that second trigger. You might need to include some logic to make sure they aren't processed twice.

AWS - want to upload multiple files to S3 and only when all are uploaded trigger a lambda function

I am seeking advice on what's the best way to design this -
Use Case
I want to put multiple files into S3. Once all files are successfully saved, I want to trigger a lambda function to do some other work.
Naive Approach
The way I am approaching this is by saving a record in Dynamo that contains a unique identifier and the total number of records I will be uploading along with the keys that should exist in S3.
A basic implementation would be to take my existing lambda function which is invoked anytime my S3 bucket is written into, and have it check manually whether all the other files been saved.
The Lambda function would know (look in Dynamo to determine what we're looking for) and query S3 to see if the other files are in. If so, use SNS to trigger my other lambda that will do the other work.
Edit: Another approach is have my client program that puts the files in S3 be responsible for directly invoking the other lambda function, since technically it knows when all the files have been uploaded. The issue with this approach is that I do not want this to be the responsibility of the client program... I want the client program to not care. As soon as it has uploaded the files, it should be able to just exit out.
Thoughts
I don't think this is a good idea. Mainly because Lambda functions should be lightweight, and polling the database from within the Lambda function to get the S3 keys of all the uploaded files and then checking in S3 if they are there - doing this each time seems ghetto and very repetitive.
What's the better approach? I was thinking something like using SWF but am not sure if that's overkill for my solution or if it will even let me do what I want. The documentation doesn't show real "examples" either. It's just a discussion without much of a step by step guide (perhaps I'm looking in the wrong spot).
Edit In response to mbaird's suggestions below-
Option 1 (SNS) This is what I will go with. It's simple and doesn't really violate the Single Responsibility Principal. That is, the client uploads the files and sends a notification (via SNS) that its work is done.
Option 2 (Dynamo streams) So this is essentially another "implementation" of Option 1. The client makes a service call, which in this case, results in a table update vs. a SNS notification (Option 1). This update would trigger the Lambda function, as opposed to notification. Not a bad solution, but I prefer using SNS for communication rather than relying on a database's capability (in this case Dynamo streams) to call a Lambda function.
In any case, I'm using AWS technologies and have coupling with their offering (Lambda functions, SNS, etc.) but I feel relying on something like Dynamo streams is making it an even tighter coupling. Not really a huge concern for my use case but still feels dirty ;D
Option 3 with S3 triggers My concern here is the possibility of race conditions. For example, if multiple files are being uploaded by the client simultaneously (think of several async uploads fired off at once with varying file sizes), what if two files happen to finish uploading at around the same time, and two or more Lambda functions (or whatever implementations we use) query Dynamo and gets back N as the completed uploads (instead of N and N+1)? Now even though the final result should be N+2, each one would add 1 to N. Nooooooooooo!
So Option 1 wins.
If you don't want the client program responsible for invoking the Lambda function directly, then would it be OK if it did something a bit more generic?
Option 1: (SNS) What if it simply notified an SNS topic that it had completed a batch of S3 uploads? You could subscribe your Lambda function to that SNS topic.
Option 2: (DynamoDB Streams) What if it simply updated the DynamoDB record with something like an attribute record.allFilesUploaded = true. You could have your Lambda function trigger off the DynamoDB stream. Since you are already creating a DynamoDB record via the client, this seems like a very simple way to mark the batch of uploads as complete without having to code in knowledge about what needs to happen next. The Lambda function could then check the "allFilesUploaded" attribute instead of having to go to S3 for a file listing every time it is called.
Alternatively, don't insert the DynamoDB record until all files have finished uploading, then your Lambda function could just trigger off new records being created.
Option 3: (continuing to use S3 triggers) If the client program can't be changed from how it works today, then instead of listing all the S3 files and comparing them to the list in DynamoDB each time a new file appears, simply update the DynamoDB record via an atomic counter. Then compare the result value against the size of the file list. Once the values are the same you know all the files have been uploaded. The down side to this is that you need to provision enough capacity on your DynamoDB table to handle all the updates, which is going to increase your costs.
Also, I agree with you that SWF is overkill for this task.