I have a requirement where my lambda processes files from 2 s3 locations, I don't know when the file will land. But I want to trigger lambda only when I receive s3:PutObject from both the locations.
Is it possible to define a event rule, which can monitor or wait for files to be put in both s3 locations and then only forward it to target?
Yes I know, there are many different ways to solve this. This is not the exact problem statement but I just created a basic use-case related to my requirement.
So, Kindly advise me if its possible through Aws EventBridge or not, if yes , how can I configure it.
Thanks.
I went through event bridge documentation and couldn't find anything similar to my requirement.
The answer to your question, is there an event rule and can you use EB; is no you cannot. You need to manually implement some intermediary to cache whichever lands first, and artificially create a single EB event when the second S3 object lands.
The simplest is probably S3 => DDB and use whatever key is common between them to be the DDB Item Id, putting the s3 event on an item key. Then route a DDB Stream => Event Bridge, so you will get the event containing both files' metadata.
Related
Required procedure:
Someone does an upload to an S3 bucket.
This triggers a Lambda function that does some processing on the uploaded file(s).
Processed objects are now copied into a "processed" folder within the same bucket.
The copy-operation in Step 3 should never re-trigger the initial Lambda function itself.
I know that the general guidance is to use a different bucket for storing the processed objects in a situation like this (but this is not possible in this case).
So my approach was to set up the S3 trigger to only listen to PUT/POST-Method and excluded the COPY-Method. The lambda function itself uses python-boto (S3_CLIENT.copy_object(..)). The approach seems to work (the lambda function seems to not be retriggered by the copy operation)
However I wanted to ask if this approach is really reliable - is it?
You can filter which events trigger the S3 notification.
There are 2 ways to trigger lambda from S3 event in general: bucket notifications and EventBridge.
Notifications: https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-filtering.html
EB: https://aws.amazon.com/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/
In your case, a quick search doesn't show me that you can setup a "negative" rule, so "everything which doesn't have processed prefix". But you can rework your bucket structure a bit and dump unprocessed items into unprocessed and setup filter based on that prefix only.
When setting up an S3 trigger for lambda function, there is the possibility, to define which kind of overarching S3-event should be listened to:
My architecture allows files to be put in s3 for which Lambda function runs concurrently. However, the files being put in S3 are somehow overwriting because of some other process in a gap of milliseconds. Those multiple put events for the same file are causing the lambda to trigger multiple times for the same event.
Is there a threshold I can set on s3 events (something that doesn't trigger the lambda multiple times for the same file event.)
Or what kind of s3 event only occurs when a file is created and not updated?
There is already a code in place which checks if the trigger file is present. if not, it creates the trigger file. But that is also of no use since the other process is very fast to put files is s3.
Something like this below -
try:
s3_client.head_object(Bucket=trigger_bucket, Key=trigger_file)
except ClientError as _:
create_trigger_file(
s3_client, trigger_bucket, trigger_file
)
You could configure Amazon S3 to send events to an Amazon SQS FIFO (first-in-first-out) queue. The queue could then trigger the Lambda function.
The benefit of using a FIFO queue is that each message has a Message Group ID. A FIFO queue will only provide one message to the AWS Lambda function per Message Group ID. It will not send another message with the same Message Group ID until the earlier one has been fully processed. If you set the Message Group Id to be the Key of the S3 object, then it would effectively have a separate queue for each object created in S3.
This method would allow Lambda functions to run in parallel for different objects, but for each particular Key there would only be a maximum of one Lambda function executing.
It appears your problem is that multiple invocations of the AWS Lambda function are attempting to access the same files at the same time.
To avoid this, you could modify the settings on the Lambda function to Manage Lambda reserved concurrency - AWS Lambda by setting the reserved concurrency to 1. This will only allow a single invocation of the Lambda function to run at any time.
I guess the problem is that your architecture needs to write to the same file. This is not scalable. From the documentation:
Amazon S3 does not support object locking for concurrent writers. If two PUT requests are simultaneously made to the same key, the request with the latest timestamp wins. If this is an issue, you must build an object-locking mechanism into your application.
So, think about your architecture. Why do you have a process that wants to process multiple times to the same file at the same time? The lambda's that create these S3 files, do they need to write to the same file? If I understand your use case correctly, every lambda could create an unique file. For example, based on the name of the PDF you want to create or with some timestamp added to it. That ensures you don't have write collisions. You could create lifecycle rules on the S3 bucket to delete the files after a day or so, such that you don't increase your storage costs too much. Or have a lambda delete the file when it is finished with it.
I'm searching for a method to track the identities which are doing modifications on my table besides the application service itself. In the beginning I though there could be two options, but:
CloudTrail - the documentation (Logging DynamoDB Operations by Using AWS CloudTrail) says, as far as I understood, I'd be only able to track changes made to the infrastructure itself, but not to the actual use of a table.
DynamoDB Streams - I'd guessed that the modifying identity is also passed in a stream event, but actually it's not. I'm using NEW_AND_OLD_IMAGES as the stream type.
Am I overlooking something or is there probably another possibility anywhere else? The streams event does pass me an EventID. Is this of use somewhere?
Grateful for any tips on how to solve this, even if it's a complete different approach.
AWS CloudTrail now supports logging for DynamoDB actions!
AWS CloudTrail Adds Logging of Data Events for Amazon DynamoDB
I have multiple folders inside a bucket each folder is named as a unique guid and it is always going to contain a single file.
I need to fetch only those files which have never been read before. If I'll fetch all the objects at once and then do client side filtering it might introduce latency in the near future as every day the number of new folders getting added could be hundreds.
Initially I tried to list object by specifying StartAfter, but soon I realized it only works with alphabetically sorted list.
https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjectsV2.html
I am using AWS C# SDK. Can someone please give me some idea about the best approach.
Thanks
Amazon S3 does not maintain a concept of "objects that have not been accessed".
However, there is a different approach to process each object only once:
Create an Amazon S3 Event that will trigger when an object is created
The Event can then trigger:
An AWS Lambda function, or
Send a message to an Amazon SQS queue, or
Send a message to an Amazon SNS topic
You could therefore trigger your custom code via one of these methods, and you will never actually need to "search" for new objects.
I need to start a Lambda Function when an object has been created on an S3 Bucket. I found 2 solutions to do this.
Using AWS::S3::Bucket NotificationConfiguration.
Using a CloudWatch AWS::Events::Rule.
They both seem to do exactly the same thing, which is to track specific changes and launch a Lambda Function when it happens. I could not find any information on which one should be used. I'm using Cloud Formation Template to provision the Lambda, the S3 Bucket and the trigger.
Which one should I use to call a Lambda on Object level changes and why?
Use the 1st one because of
A push model is much better than a pull model. Push means you send data when you get it instead of polling onto something for some set of interval. This is an era for push notifications all over us. You don't go to facebook to check every 5 minutes if someone has liked your picture or not OR someone has replied to your comment, etc.
In terms of cost and efforts also, S3 event notification wins the race.
Cloudwatch was the best option if you didn't have S3 notification but since you have it, that's the best. Plus if you have a feature in the service itself then why will you go for an alternative solution like Cloudwatch rules.