AWS Lambda function getting called repeatedly - amazon-web-services

I have written a Lambda function which gets invoked automatically when a file comes into my S3 bucket.
I perform certain validations on this file, modify the particular and put the file at the same location.
Due to this "put", my lambda is called again and the process goes on till my lambda execution times out.
Is there any way to trigger this lambda only once?
I found an approach where I can store the file name in DynamoDB and can apply a check in lambda function, but can there be any other approach where DynamoDB's use can be avoided?

You have a couple options:
You can put the file to a different location in s3 and delete the original
You can add a metadata field to the s3 object when you update it. Then check for the presence of that field in s3 so you know if you have processed it already. Now this might not work perfectly since s3 does not always provide the most recent data on reads after updates.

AWS allows different type of s3 event triggers. You can try playing s3:ObjectCreated:Put vs s3:ObjectCreated:Post.

You can upload your files in a folder, say
s3://bucket-name/notvalidated
and store the validated in another folder, say
s3://bucket-name/validated.
Update your S3 Event notification to invoke your lambda function whenever there is a ObjectCreate(All) event in the /notvalidated prefix.

The second answer does not seem to be correct (put vs post) - there is not really a concept of update in S3 in terms of POST or PUT. The request to update an object will be the same as the initial POST of the object. See here for details on the available S3 events.
I had this exact problem last year - I was doing an image resize on PUT and every time a file was overwritten, it would be triggered again. My recommended solution would be to have two folders in your s3 bucket - one for the original file and one for the finalized file. You could then create the lambda trigger with the lambda prefix so it only checks the files in the original folder

The events are triggered in S3 based on if the object is put/post/copy/complete Multipart Upload - All these operations corresponds to ObjectCreate as per AWS documentation .
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
The best solution is to restrict your S3 object create event to particular bucket location. So that any change in that bucket location will trigger lambda function.
You can do the modification in some other bucket location which is not configured to trigger lambda function when object is created in that location.
Hope it helps!

Related

How to move file to a sublevel in s3 without triggering lambda?

An excel file (say my_excel_file.xlsx) will be uploaded to s3://my-bucket/a/b/
A trigger is set in lambda with following properties:
bucket-name: my-bucket, prefix: a/b/
I want my lambda to:
Read the excel file uploaded to s3://my-bucket/a/b/ into a pandas dataframe
After processing it, move the excel file to s3://my-bucket/a/b/archive/ with the name: my_excel_file_timestamp.xlsx
In case I am able to achieve the above step, will the lambda get invoked recursively? If yes, is there a workaround?
Since Amazon S3 event is configured to trigger on prefix a/b/, then it will trigger the AWS Lambda function when an object is placed into a/b/archive/.
I recommend adding a line of code at the top of the Lambda function that checks the Key, which is passed to the function via the event parameter. It should check if the Key starts with a/b/archive/ (or similar rule) -- if so, it should exit the function immediately. This will not incur a significant cost because it will exit quickly and Lambda is only charged per millisecond.
The alternative is to put your archive folder in a different location.

Automated Real Time Data Processing on AWS with Lambda

I am interested in doing automated real-time data processing on AWS using Lambda and I am not certain about how I can trigger my Lambda function. My data processing code involves taking multiple files and concatenating them into a single data frame after performing calculations on each file. Since files are uploaded simultaneously onto S3 and files are dependent on each other, I would like the Lambda to be only triggered when all files are uploaded.
Current Approaches/Attempts:
-I am considering an S3 trigger, but my concern is that an S3 Trigger will result in an error in the case where a single file upload triggers the Lambda to start. An alternate option would be adding a wait time but that is not preferred to limit the computation resources used.
-A scheduled trigger using Cloudwatch/EventBridge, but this would not be real-time processing.
-SNS trigger, but I am not certain if the message can be automated without knowing the completion in file uploads.
Any suggestion is appreciated! Thank you!
If you really cannot do it with a scheduled function, the best option is to trigger a Lambda function when an object is created.
The tricky bit is that it will fire your function on each object upload. So you either can identify the "last part", e.g., based on some meta data, or you will need to store and track the state of all uploads, e.g. in a DynamoDB, and do the actual processing only when a batch is complete.
Best, Stefan
Your file coming in parts might be named as -
filename_part1.ext
filename_part2.ext
If any of your systems is generating those files, then use the system to generate a final dummy blank file name as -
filename.final
Since in your S3 event trigger you can use a suffix to generate an event, use .final extension to invoke lambda, and process records.
In an alternative approach, if you do not have access to the server putting objects to your s3 bucket, then with each PUT operation in your s3 bucket, invoke the lambda and insert an entry in dynamoDB.
You need to put a unique entry per file (not file parts) in dynamo with -
filename and last_part_recieved_time
The last_part_recieved_time keeps getting updated till you keep getting the file parts.
Now, this table can be looked up by a cron lambda invocation which checks if the time skew (time difference between SYSTIME of lambda invocation and dynamoDB entry - last_part_recieved_time) is enough to process the records.
I will still prefer to go with the first approach as the second one still has a chance for error.
Since you want this to be as real time as possible, perhaps you could just perform your logic every single time a file is uploaded, updating the version of the output as new files are added, and iterating through an S3 prefix per grouping of files, like in this other SO answer.
In terms of the architecture, you could add in an SQS queue or two to make this more resilient. An S3 Put Event can trigger an SQS message, which can trigger a Lambda function, and you can have error handling logic in the Lambda function that puts that event in a secondary queue with a visibility timeout (sort of like a backoff strategy) or back in the same queue for retries.

Automating folder creation in S3

I have an S3 bucket into which clients drop data files (CSV files) each month. I was wondering there was a way that I could automatically create a new "folder" (object) every time the files are dropped each month and put the newest files into that "folder". I need the CSV files separated by month so that AWS Glue can create new partitions when I run incremental crawlers on this bucket.
For example, let's say I have a S3 bucket called "client." On December 1st, a new CSV file ("DecClientData") will be dropped into that "client" bucket. I want to know if there is a way to automate the following two processes:
Create a "folder" (let's call it "dec") within "client".
Place the "DecClientData" file in the "dec" "folder".
Thanks in advance for any assistance you can provide!
S3 doesn't have the notion of folders commonly found in file systems but instead has a flat structure, more details can be found here.
Instead, the full path of an object is stored in its Key (filename). For example, an object can be stored in Amazon S3 with a Key of files/2020-12/data.txt regardless of the existence of files and 2020-12 directories (they are not really directories but zero-length objects).
In your case, to solve both points you are mentioning, you should leverage S3 event notifications and use them as a Lambda Trigger. When the Lambda function is triggered, it is passed the name of the object (Key) as an argument, at that point you can simply change its Key.
I.e. Object is uploaded in s3://my_bucket/uploads/file.txt, this creates an event notification that triggers a Lambda function. The functions gets the object and re-uploads it to s3://my_bucket/files/dec/file.txt (and deletes the original one).
Write an AWS Lambda function to create a folder in the client bucket and move the most recent .csv file (or files) in the new folder.
Then, configure the client S3 bucket to trigger the AWS Lambda function on new uploads through the event notification settings.

Is there a way to pass file size as an input parameter from an AWS S3 bucket to a StepFunctions state machine?

I'm currently struggling to configure automated input to my AWS StepFunctions state machine. Basically, I am trying to set up a state machine that is notified whenever an object create event takes place in a certain S3 bucket. When that happens, the input is passed to a choice state which checks the file size. If the file is small enough, it invokes a Lambda function to process the file contents. If the file is too large, it invokes a Lambda to split up the file into files of manageable size, and then invokes the other Lambda to process the contents of those files. The problem with this is that I cannot figure out a way to pass the file size in as input to the state machine.
I am generally aware of how input is passed to StepFunctions, and I know that S3 Lambda triggers contain file size as a parameter, but I still haven't been able to figure out a practical way of passing file size as an input parameter to a StepFunctions state machine.
I would greatly appreciate any help on this issue and am happy to clarify or answer any questions that you have to the best of my ability. Thank you!
Currently S3 events can't triggers Step Function directly, so one option would be to create a S3 event that triggers a lambda. The lambda works as a proxy and passes the file info to the step function and kicks it off, also you can select data you want and only pass selective data to Step Functions.
The other option is to configure a state machine as a target for a CloudWatch Events rule. This will start an execution when files are added to an Amazon S3 bucket.
The first option is more flexible.

Configure s3 event for alternate PUT operation

I have a Lambda function that gets triggered whenever an object is created in s3 bucket.
Now, I need to trigger the Lambda for alternate object creation.
Lambda should not be triggered when object is created for the first, third , fifth and so on time. But, Lambda should be triggered for the second, fourth, sixth and so on time.
For this, I created an s3 event for 'PUT' operation.
The first time I used the PUT API. The second time I uploaded the file using -
s3_res.meta.client.upload_file
I thought that it would not trigger lambda since this was upload and not PUT. But this also triggered the Lambda.
Is there any way for this?
The reason that meta.client.upload_file is triggering your PUT event lambda is because it is actually using PUT.
upload_file (docs) uses the TransferManager client, which uses PUT under-the-hood (you can see this in the code: https://github.com/boto/s3transfer/blob/develop/s3transfer/upload.py)
Looking at the AWS-SDK you'll see that POST'ing to S3 is pretty much limited to when you want to give a browser/client a pre-signed URL for them to upload a file to. (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectPOST.html)
If you want to count the number of times PUT has been called, in order to take action on every even call, then the easiest thing to do is to use something like DynamoDB to create a table of 'file-name' vs 'put-count' which you update with every PUT and action accordingly.
Alternatively you could enable bucket file versioning. Then you could use list_object_versions to see how many times the file has been updated. Although you should be aware that S3 is eventually consistent, so this may not be accurate if the file is being rapidly updated.