Processing S3 events in order with Lambda - amazon-web-services

I am setting up an S3 bucket. In this S3 bucket, data is going to be written by an external process.
I am setting up an AWS Lambda that would be triggered when an object in S3 gets created/updated and would process and store the data in RDS.
Here my question is as follows:
If objects get written too fast on s3, there is a possibility for multiple Lambda functions
to get triggered simulatenously.
So, in this case, is there any chance for the objects to be processed not in the order they
are written to the S3 bucket?
If the answer to the above question is yes, then with Lambda, I have to push the payload to
FIFO SQS and set up a listener to process the payload to store the data into RDS finally.

Sadly, they are not guaranteed to be in order. From docs:
Event notifications are not guaranteed to arrive in the order that the events occurred. However, notifications from events that create objects (PUTs) and delete objects contain a sequencer, which can be used to determine the order of events for a given object key.

Related

How to ensure that S3 upload triggers a lambda function, but copying data within the same bucket does not trigger the lambda function anymore?

Required procedure:
Someone does an upload to an S3 bucket.
This triggers a Lambda function that does some processing on the uploaded file(s).
Processed objects are now copied into a "processed" folder within the same bucket.
The copy-operation in Step 3 should never re-trigger the initial Lambda function itself.
I know that the general guidance is to use a different bucket for storing the processed objects in a situation like this (but this is not possible in this case).
So my approach was to set up the S3 trigger to only listen to PUT/POST-Method and excluded the COPY-Method. The lambda function itself uses python-boto (S3_CLIENT.copy_object(..)). The approach seems to work (the lambda function seems to not be retriggered by the copy operation)
However I wanted to ask if this approach is really reliable - is it?
You can filter which events trigger the S3 notification.
There are 2 ways to trigger lambda from S3 event in general: bucket notifications and EventBridge.
Notifications: https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-filtering.html
EB: https://aws.amazon.com/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/
In your case, a quick search doesn't show me that you can setup a "negative" rule, so "everything which doesn't have processed prefix". But you can rework your bucket structure a bit and dump unprocessed items into unprocessed and setup filter based on that prefix only.
When setting up an S3 trigger for lambda function, there is the possibility, to define which kind of overarching S3-event should be listened to:

How to make sure S3 doesn't put parallel event to lambda?

My architecture allows files to be put in s3 for which Lambda function runs concurrently. However, the files being put in S3 are somehow overwriting because of some other process in a gap of milliseconds. Those multiple put events for the same file are causing the lambda to trigger multiple times for the same event.
Is there a threshold I can set on s3 events (something that doesn't trigger the lambda multiple times for the same file event.)
Or what kind of s3 event only occurs when a file is created and not updated?
There is already a code in place which checks if the trigger file is present. if not, it creates the trigger file. But that is also of no use since the other process is very fast to put files is s3.
Something like this below -
try:
s3_client.head_object(Bucket=trigger_bucket, Key=trigger_file)
except ClientError as _:
create_trigger_file(
s3_client, trigger_bucket, trigger_file
)
You could configure Amazon S3 to send events to an Amazon SQS FIFO (first-in-first-out) queue. The queue could then trigger the Lambda function.
The benefit of using a FIFO queue is that each message has a Message Group ID. A FIFO queue will only provide one message to the AWS Lambda function per Message Group ID. It will not send another message with the same Message Group ID until the earlier one has been fully processed. If you set the Message Group Id to be the Key of the S3 object, then it would effectively have a separate queue for each object created in S3.
This method would allow Lambda functions to run in parallel for different objects, but for each particular Key there would only be a maximum of one Lambda function executing.
It appears your problem is that multiple invocations of the AWS Lambda function are attempting to access the same files at the same time.
To avoid this, you could modify the settings on the Lambda function to Manage Lambda reserved concurrency - AWS Lambda by setting the reserved concurrency to 1. This will only allow a single invocation of the Lambda function to run at any time.
I guess the problem is that your architecture needs to write to the same file. This is not scalable. From the documentation:
Amazon S3 does not support object locking for concurrent writers. If two PUT requests are simultaneously made to the same key, the request with the latest timestamp wins. If this is an issue, you must build an object-locking mechanism into your application.
So, think about your architecture. Why do you have a process that wants to process multiple times to the same file at the same time? The lambda's that create these S3 files, do they need to write to the same file? If I understand your use case correctly, every lambda could create an unique file. For example, based on the name of the PDF you want to create or with some timestamp added to it. That ensures you don't have write collisions. You could create lifecycle rules on the S3 bucket to delete the files after a day or so, such that you don't increase your storage costs too much. Or have a lambda delete the file when it is finished with it.

AWS Lambda S3 event infinite loop

I want to use S3 events to publish to AWS Lambda whenever a video file (.mp4) gets uploaded, so that it can be compressed. The problem is that the path to the video file is stored in RDS, so I want the path to remain the same after compression. From what I've read, replacing the file will again call the Object Created event leading to an infinite loop.
Is there any way to replace the file without triggering any event? What are my options?
You are correct that you cannot completely distinguish. From the documentation the following events are supported:
s3:ObjectCreated:Put – An object was created by an HTTP PUT operation.
s3:ObjectCreated:Post – An object was created by HTTP POST operation.
s3:ObjectCreated:Copy – An object was created an S3 copy operation.
s3:ObjectCreated:CompleteMultipartUpload – An object was created by the completion of a S3 multi-part upload.
s3:ObjectCreated:* – An object was created by one of the event types listed above or by a similar object creation event added in the future.
s3:ReducedRedundancyObjectLost – An S3 object stored with Reduced Redundancy has been lost.
The architecture that I would generally see for this type of problem is having 2 S3 buckets
1 S3 Bucket stores the source material without any modification, this would trigger the Lambda event.
1 S3 Bucket stores the processed artifact, from the compressed output.
By doing this you can store the original, and rerun if needed to autocorrect.
There is an ungraceful solution to this problem, which is not documented anywhere.
The event parameter in the Lambda function contains a userIdentity dict which contains principalId. For an event which originated because of AWS Lambda (like updating S3 object as mentioned in question), this principalId contains the name of the lambda function appended at the end.
Therefore, by checking the principalId one can deduce whether the event came from Lambda or not, and accordingly compress or not.

Using S3 instead of SQS for integration purposes

folks.
There's a question I've recently faced, which brought some concerns and hesitations. I'm creating an "almost serverless" micro-service using AWS. Here is its workflow: Workflows options
The thing is the input message may be large, but AWS SQS limits message size to 256 Kb. So, I decided to use S3 and S3 notifications to handle inputs: client PUTs an object; its creation triggers Lambda functions and so on. In that way, 256 Kb limits are not relevant, but on the other hand, I'm using storage service as an integration one. One of the concerns is a dead letter queue handling f.e.
Maybe someone has faced similar problems. One of the things is to keep "serverless". Are there any good solutions/improvements/advice?
Thanks in advance.
I would recommend combining the two approaches:
Write the data to Amazon S3
Create a message in the Amazon SQS queue that includes a reference to the data in S3
This way, you have the benefits of using a queue, with additional storage.
If all the data you require is already in the file, then you can configure an Amazon S3 Event to create the SQS message directly in the queue. The message will include the name of the bucket and the key of the object. Thus, putting the file in S3 will create the SQS message and trigger the AWS Lambda function. This is more scalable than directly triggering the Lambda function from S3.
Have you considered using Kinesis Stream, then attaching your lambda to the stream with a size of 1? You could also handle your dead-letter with shard time stamps etc.
Or if you are able to manipulate the original message place your timestamp inside the message then you can easily utilize kinesis firehose and bulk load messages. Limitation on Kinesis is 2mb so will give you almost 10x the message size on sqs which you could compress further.

Multiple S3 events together trigger a lambda function

I have a scenario where I have two buckets s3-a and s3-b.
When data is put into s3-a it sends out an S3 event.
The same happens with s3-b.
I need to trigger a lambda function when I have the data in both the S3 buckets.
One way I could think of is use a dynamodb as a marker if a corresponding S3 object is found, then through dynamodb streams invoke a lambda which checks if both the markers are true.
Check the data in both buckets in each trigger. Whichever trigger finds data in both buckets, proceeds further. Make it Idempotent so that if both triggers find the data in both buckets, there is no adverse effect.