I want to use S3 events to publish to AWS Lambda whenever a video file (.mp4) gets uploaded, so that it can be compressed. The problem is that the path to the video file is stored in RDS, so I want the path to remain the same after compression. From what I've read, replacing the file will again call the Object Created event leading to an infinite loop.
Is there any way to replace the file without triggering any event? What are my options?
You are correct that you cannot completely distinguish. From the documentation the following events are supported:
s3:ObjectCreated:Put – An object was created by an HTTP PUT operation.
s3:ObjectCreated:Post – An object was created by HTTP POST operation.
s3:ObjectCreated:Copy – An object was created an S3 copy operation.
s3:ObjectCreated:CompleteMultipartUpload – An object was created by the completion of a S3 multi-part upload.
s3:ObjectCreated:* – An object was created by one of the event types listed above or by a similar object creation event added in the future.
s3:ReducedRedundancyObjectLost – An S3 object stored with Reduced Redundancy has been lost.
The architecture that I would generally see for this type of problem is having 2 S3 buckets
1 S3 Bucket stores the source material without any modification, this would trigger the Lambda event.
1 S3 Bucket stores the processed artifact, from the compressed output.
By doing this you can store the original, and rerun if needed to autocorrect.
There is an ungraceful solution to this problem, which is not documented anywhere.
The event parameter in the Lambda function contains a userIdentity dict which contains principalId. For an event which originated because of AWS Lambda (like updating S3 object as mentioned in question), this principalId contains the name of the lambda function appended at the end.
Therefore, by checking the principalId one can deduce whether the event came from Lambda or not, and accordingly compress or not.
Related
Required procedure:
Someone does an upload to an S3 bucket.
This triggers a Lambda function that does some processing on the uploaded file(s).
Processed objects are now copied into a "processed" folder within the same bucket.
The copy-operation in Step 3 should never re-trigger the initial Lambda function itself.
I know that the general guidance is to use a different bucket for storing the processed objects in a situation like this (but this is not possible in this case).
So my approach was to set up the S3 trigger to only listen to PUT/POST-Method and excluded the COPY-Method. The lambda function itself uses python-boto (S3_CLIENT.copy_object(..)). The approach seems to work (the lambda function seems to not be retriggered by the copy operation)
However I wanted to ask if this approach is really reliable - is it?
You can filter which events trigger the S3 notification.
There are 2 ways to trigger lambda from S3 event in general: bucket notifications and EventBridge.
Notifications: https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-filtering.html
EB: https://aws.amazon.com/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/
In your case, a quick search doesn't show me that you can setup a "negative" rule, so "everything which doesn't have processed prefix". But you can rework your bucket structure a bit and dump unprocessed items into unprocessed and setup filter based on that prefix only.
When setting up an S3 trigger for lambda function, there is the possibility, to define which kind of overarching S3-event should be listened to:
My architecture allows files to be put in s3 for which Lambda function runs concurrently. However, the files being put in S3 are somehow overwriting because of some other process in a gap of milliseconds. Those multiple put events for the same file are causing the lambda to trigger multiple times for the same event.
Is there a threshold I can set on s3 events (something that doesn't trigger the lambda multiple times for the same file event.)
Or what kind of s3 event only occurs when a file is created and not updated?
There is already a code in place which checks if the trigger file is present. if not, it creates the trigger file. But that is also of no use since the other process is very fast to put files is s3.
Something like this below -
try:
s3_client.head_object(Bucket=trigger_bucket, Key=trigger_file)
except ClientError as _:
create_trigger_file(
s3_client, trigger_bucket, trigger_file
)
You could configure Amazon S3 to send events to an Amazon SQS FIFO (first-in-first-out) queue. The queue could then trigger the Lambda function.
The benefit of using a FIFO queue is that each message has a Message Group ID. A FIFO queue will only provide one message to the AWS Lambda function per Message Group ID. It will not send another message with the same Message Group ID until the earlier one has been fully processed. If you set the Message Group Id to be the Key of the S3 object, then it would effectively have a separate queue for each object created in S3.
This method would allow Lambda functions to run in parallel for different objects, but for each particular Key there would only be a maximum of one Lambda function executing.
It appears your problem is that multiple invocations of the AWS Lambda function are attempting to access the same files at the same time.
To avoid this, you could modify the settings on the Lambda function to Manage Lambda reserved concurrency - AWS Lambda by setting the reserved concurrency to 1. This will only allow a single invocation of the Lambda function to run at any time.
I guess the problem is that your architecture needs to write to the same file. This is not scalable. From the documentation:
Amazon S3 does not support object locking for concurrent writers. If two PUT requests are simultaneously made to the same key, the request with the latest timestamp wins. If this is an issue, you must build an object-locking mechanism into your application.
So, think about your architecture. Why do you have a process that wants to process multiple times to the same file at the same time? The lambda's that create these S3 files, do they need to write to the same file? If I understand your use case correctly, every lambda could create an unique file. For example, based on the name of the PDF you want to create or with some timestamp added to it. That ensures you don't have write collisions. You could create lifecycle rules on the S3 bucket to delete the files after a day or so, such that you don't increase your storage costs too much. Or have a lambda delete the file when it is finished with it.
I am setting up an S3 bucket. In this S3 bucket, data is going to be written by an external process.
I am setting up an AWS Lambda that would be triggered when an object in S3 gets created/updated and would process and store the data in RDS.
Here my question is as follows:
If objects get written too fast on s3, there is a possibility for multiple Lambda functions
to get triggered simulatenously.
So, in this case, is there any chance for the objects to be processed not in the order they
are written to the S3 bucket?
If the answer to the above question is yes, then with Lambda, I have to push the payload to
FIFO SQS and set up a listener to process the payload to store the data into RDS finally.
Sadly, they are not guaranteed to be in order. From docs:
Event notifications are not guaranteed to arrive in the order that the events occurred. However, notifications from events that create objects (PUTs) and delete objects contain a sequencer, which can be used to determine the order of events for a given object key.
Here's my requirements. Every day i'm receiving a CSV file into an S3 bucket. I need to partition that data and store it into Parquet to eventually map a Table. I was thinking about using AWS lambda function that is triggered whenever a file is uploaded. I'm not sure what are the steps to do that.
There are (as usual in AWS!) several ways to do this, the 2 first ones that come to me first are:
using a Cloudwatch Event, with an S3 PutObject Object level) action as trigger, and a lambda function that you have already created as a target.
starting from the Lambda function it is slightly easier to add suffix-filtered triggers, eg for any .csv file, by going to the function configuration in the Console, and in the Designer section adding a trigger, then choose S3 and the actions you want to use, eg bucket, event type, prefix, suffix.
In both cases, you will need to write the lambda function in either case to do the work you have described, and it will need IAM access to the bucket to pull the files and process them.
I am creating an AWS Lambda function that is triggered for each PUT on an S3 bucket. A separate Java application creates the S3 bucket, sets up the trigger to the Lambda on Put, and PUTs a set of files into the bucket. The Lambda function executes a compiled binary, it passes to the binary a script, which acts on the new S3 object.
All of this is working fine.
My problem is that I have a set of close to 100 different scripts, and am regularly developing new scripts. The ZIP for the Lambda contains all the scripts. Scripts correspond to different types of files, so when I run the Java application, I want to specify WHICH script in the Lambda function to use. I'm trying to avoid having to create a new Lambda for each script, since each one effectively does the exact same thing but for the name of the script.
When you INVOKE a Lambda, you can put parameters into the context. But my Lambda is triggered, so most of what I react to is in the event. I can't figure out how to communicate this simple parameter to the Lambda efficiently as I set up the S3 bucket and the event trigger.
How can I do this?
You can't have S3 post extra parameters to your Lambda function. What you can do is create a DynamoDB table that maps S3 buckets to scripts, or S3 prefixes to scripts, or something of the sort. Then your Lambda function can lookup that mapping before executing your script.
It is not possible to specify parameters that are passed to the AWS Lambda function. The function is triggered by Amazon S3, which passes standard information (bucket, key).
However, when creating the object in Amazon S3 you could attach object metadata. The Lambda function could then retrieve the metadata after it has been notified of the event.
An alternate approach would be to subscribe several Lambda functions to the S3 bucket. The functions could look at the event and decide whether or not to process the event.
For example, if you had pictures and text files being stored, you could create one Lambda function for pictures and another for text files. Both functions would be triggered upon object creation. Each function would look at the file extension (or, if necessary, look within the object itself). If it is a filetype that is handles, then it can process the object. If it is not a filetype it handles, the function can simply exit. This type of check could be performed very quickly and Lambda only charges per 100ms, so the cost would be close to irrelevant.
The benefit of this approach is that you could keep your libraries separate from each other, rather than making one large Lambda package.