How to Specify Additional Parameter to AWS Lambda Function Triggered by S3 - amazon-web-services

I am creating an AWS Lambda function that is triggered for each PUT on an S3 bucket. A separate Java application creates the S3 bucket, sets up the trigger to the Lambda on Put, and PUTs a set of files into the bucket. The Lambda function executes a compiled binary, it passes to the binary a script, which acts on the new S3 object.
All of this is working fine.
My problem is that I have a set of close to 100 different scripts, and am regularly developing new scripts. The ZIP for the Lambda contains all the scripts. Scripts correspond to different types of files, so when I run the Java application, I want to specify WHICH script in the Lambda function to use. I'm trying to avoid having to create a new Lambda for each script, since each one effectively does the exact same thing but for the name of the script.
When you INVOKE a Lambda, you can put parameters into the context. But my Lambda is triggered, so most of what I react to is in the event. I can't figure out how to communicate this simple parameter to the Lambda efficiently as I set up the S3 bucket and the event trigger.
How can I do this?

You can't have S3 post extra parameters to your Lambda function. What you can do is create a DynamoDB table that maps S3 buckets to scripts, or S3 prefixes to scripts, or something of the sort. Then your Lambda function can lookup that mapping before executing your script.

It is not possible to specify parameters that are passed to the AWS Lambda function. The function is triggered by Amazon S3, which passes standard information (bucket, key).
However, when creating the object in Amazon S3 you could attach object metadata. The Lambda function could then retrieve the metadata after it has been notified of the event.
An alternate approach would be to subscribe several Lambda functions to the S3 bucket. The functions could look at the event and decide whether or not to process the event.
For example, if you had pictures and text files being stored, you could create one Lambda function for pictures and another for text files. Both functions would be triggered upon object creation. Each function would look at the file extension (or, if necessary, look within the object itself). If it is a filetype that is handles, then it can process the object. If it is not a filetype it handles, the function can simply exit. This type of check could be performed very quickly and Lambda only charges per 100ms, so the cost would be close to irrelevant.
The benefit of this approach is that you could keep your libraries separate from each other, rather than making one large Lambda package.

Related

How to ensure that S3 upload triggers a lambda function, but copying data within the same bucket does not trigger the lambda function anymore?

Required procedure:
Someone does an upload to an S3 bucket.
This triggers a Lambda function that does some processing on the uploaded file(s).
Processed objects are now copied into a "processed" folder within the same bucket.
The copy-operation in Step 3 should never re-trigger the initial Lambda function itself.
I know that the general guidance is to use a different bucket for storing the processed objects in a situation like this (but this is not possible in this case).
So my approach was to set up the S3 trigger to only listen to PUT/POST-Method and excluded the COPY-Method. The lambda function itself uses python-boto (S3_CLIENT.copy_object(..)). The approach seems to work (the lambda function seems to not be retriggered by the copy operation)
However I wanted to ask if this approach is really reliable - is it?
You can filter which events trigger the S3 notification.
There are 2 ways to trigger lambda from S3 event in general: bucket notifications and EventBridge.
Notifications: https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-filtering.html
EB: https://aws.amazon.com/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/
In your case, a quick search doesn't show me that you can setup a "negative" rule, so "everything which doesn't have processed prefix". But you can rework your bucket structure a bit and dump unprocessed items into unprocessed and setup filter based on that prefix only.
When setting up an S3 trigger for lambda function, there is the possibility, to define which kind of overarching S3-event should be listened to:

How to make sure S3 doesn't put parallel event to lambda?

My architecture allows files to be put in s3 for which Lambda function runs concurrently. However, the files being put in S3 are somehow overwriting because of some other process in a gap of milliseconds. Those multiple put events for the same file are causing the lambda to trigger multiple times for the same event.
Is there a threshold I can set on s3 events (something that doesn't trigger the lambda multiple times for the same file event.)
Or what kind of s3 event only occurs when a file is created and not updated?
There is already a code in place which checks if the trigger file is present. if not, it creates the trigger file. But that is also of no use since the other process is very fast to put files is s3.
Something like this below -
try:
s3_client.head_object(Bucket=trigger_bucket, Key=trigger_file)
except ClientError as _:
create_trigger_file(
s3_client, trigger_bucket, trigger_file
)
You could configure Amazon S3 to send events to an Amazon SQS FIFO (first-in-first-out) queue. The queue could then trigger the Lambda function.
The benefit of using a FIFO queue is that each message has a Message Group ID. A FIFO queue will only provide one message to the AWS Lambda function per Message Group ID. It will not send another message with the same Message Group ID until the earlier one has been fully processed. If you set the Message Group Id to be the Key of the S3 object, then it would effectively have a separate queue for each object created in S3.
This method would allow Lambda functions to run in parallel for different objects, but for each particular Key there would only be a maximum of one Lambda function executing.
It appears your problem is that multiple invocations of the AWS Lambda function are attempting to access the same files at the same time.
To avoid this, you could modify the settings on the Lambda function to Manage Lambda reserved concurrency - AWS Lambda by setting the reserved concurrency to 1. This will only allow a single invocation of the Lambda function to run at any time.
I guess the problem is that your architecture needs to write to the same file. This is not scalable. From the documentation:
Amazon S3 does not support object locking for concurrent writers. If two PUT requests are simultaneously made to the same key, the request with the latest timestamp wins. If this is an issue, you must build an object-locking mechanism into your application.
So, think about your architecture. Why do you have a process that wants to process multiple times to the same file at the same time? The lambda's that create these S3 files, do they need to write to the same file? If I understand your use case correctly, every lambda could create an unique file. For example, based on the name of the PDF you want to create or with some timestamp added to it. That ensures you don't have write collisions. You could create lifecycle rules on the S3 bucket to delete the files after a day or so, such that you don't increase your storage costs too much. Or have a lambda delete the file when it is finished with it.

AWS lambda function and Athena to create partitioned table

Here's my requirements. Every day i'm receiving a CSV file into an S3 bucket. I need to partition that data and store it into Parquet to eventually map a Table. I was thinking about using AWS lambda function that is triggered whenever a file is uploaded. I'm not sure what are the steps to do that.
There are (as usual in AWS!) several ways to do this, the 2 first ones that come to me first are:
using a Cloudwatch Event, with an S3 PutObject Object level) action as trigger, and a lambda function that you have already created as a target.
starting from the Lambda function it is slightly easier to add suffix-filtered triggers, eg for any .csv file, by going to the function configuration in the Console, and in the Designer section adding a trigger, then choose S3 and the actions you want to use, eg bucket, event type, prefix, suffix.
In both cases, you will need to write the lambda function in either case to do the work you have described, and it will need IAM access to the bucket to pull the files and process them.

AWS Lambda S3 event infinite loop

I want to use S3 events to publish to AWS Lambda whenever a video file (.mp4) gets uploaded, so that it can be compressed. The problem is that the path to the video file is stored in RDS, so I want the path to remain the same after compression. From what I've read, replacing the file will again call the Object Created event leading to an infinite loop.
Is there any way to replace the file without triggering any event? What are my options?
You are correct that you cannot completely distinguish. From the documentation the following events are supported:
s3:ObjectCreated:Put – An object was created by an HTTP PUT operation.
s3:ObjectCreated:Post – An object was created by HTTP POST operation.
s3:ObjectCreated:Copy – An object was created an S3 copy operation.
s3:ObjectCreated:CompleteMultipartUpload – An object was created by the completion of a S3 multi-part upload.
s3:ObjectCreated:* – An object was created by one of the event types listed above or by a similar object creation event added in the future.
s3:ReducedRedundancyObjectLost – An S3 object stored with Reduced Redundancy has been lost.
The architecture that I would generally see for this type of problem is having 2 S3 buckets
1 S3 Bucket stores the source material without any modification, this would trigger the Lambda event.
1 S3 Bucket stores the processed artifact, from the compressed output.
By doing this you can store the original, and rerun if needed to autocorrect.
There is an ungraceful solution to this problem, which is not documented anywhere.
The event parameter in the Lambda function contains a userIdentity dict which contains principalId. For an event which originated because of AWS Lambda (like updating S3 object as mentioned in question), this principalId contains the name of the lambda function appended at the end.
Therefore, by checking the principalId one can deduce whether the event came from Lambda or not, and accordingly compress or not.

How to invoke athena triggered automatically by lambda when objects are updated in the s3 bucket?

I have following 2 use case to apply on this
Case 1. I would need to call the lambda alone to invoke athena to perform query on s3 data? Question: How to invoke lambda alone via api?
Case 2. I would need lambda function to invoke athena whenever a file copied to the same s3 bucket that already mapped to the athena?
Iam referring following link to do the same to perform the Lambda operation over athena
Link:
https://dev.classmethod.jp/cloud/run-amazon-athenas-query-with-aws-lambda/
For the case 2: Following are eg want to integrate:
File in s3-1 is sales.csv - and i would updating sales details by copying data from other s3-2 . And the schema/column defined in the s3-1 data would remain same.
so when i copy some file to the same s3 data that mapped to the athena, the lambda should call athena to perform the query
Appreciate if can provide the better way to achieve above cases?
Thanks
Case 1
An AWS Lambda can be directly invoked via the invoke() command. This can be done via the AWS Command-Line Interface (CLI) or from a programming language using an AWS SDK.
Case 2
An Amazon S3 event can be configured on a bucket to automatically trigger an AWS Lambda function when a file is uploaded. The event provides the bucket name and file name (object name) to the Lambda function.
The Lambda function can extract these details from the event record and can then use that information in an Amazon Athena command.
Please note that, if the file name is different each time, a CREATE TABLE command would be required before a SELECT command can query the data.
General Comments
A Lambda function can run for a maximum of 15 minutes, so make sure the Athena queries do not take more than this time. This is not a particularly efficient use of an AWS Lambda function because it will be billed for the duration of the function call, even if it is just waiting for Athena to finish.
Another option would be to have the Lambda function directly process the file, assuming that the query is not particularly complex. For example, the Lambda function could download the file to temporary storage (maximum 500MB), read through the file, do some calculations (eg add up the total of some columns), then store the results somewhere.
The next step wuold be create a end point to your lambda, you ver can use aws-apigateway for that.
On the other hand, using the amazon console or amazon cli, you can invoke the lambda in order to test.