Multiple S3 events together trigger a lambda function - amazon-web-services

I have a scenario where I have two buckets s3-a and s3-b.
When data is put into s3-a it sends out an S3 event.
The same happens with s3-b.
I need to trigger a lambda function when I have the data in both the S3 buckets.
One way I could think of is use a dynamodb as a marker if a corresponding S3 object is found, then through dynamodb streams invoke a lambda which checks if both the markers are true.

Check the data in both buckets in each trigger. Whichever trigger finds data in both buckets, proceeds further. Make it Idempotent so that if both triggers find the data in both buckets, there is no adverse effect.

Related

How to ensure that S3 upload triggers a lambda function, but copying data within the same bucket does not trigger the lambda function anymore?

Required procedure:
Someone does an upload to an S3 bucket.
This triggers a Lambda function that does some processing on the uploaded file(s).
Processed objects are now copied into a "processed" folder within the same bucket.
The copy-operation in Step 3 should never re-trigger the initial Lambda function itself.
I know that the general guidance is to use a different bucket for storing the processed objects in a situation like this (but this is not possible in this case).
So my approach was to set up the S3 trigger to only listen to PUT/POST-Method and excluded the COPY-Method. The lambda function itself uses python-boto (S3_CLIENT.copy_object(..)). The approach seems to work (the lambda function seems to not be retriggered by the copy operation)
However I wanted to ask if this approach is really reliable - is it?
You can filter which events trigger the S3 notification.
There are 2 ways to trigger lambda from S3 event in general: bucket notifications and EventBridge.
Notifications: https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-how-to-filtering.html
EB: https://aws.amazon.com/blogs/aws/new-use-amazon-s3-event-notifications-with-amazon-eventbridge/
In your case, a quick search doesn't show me that you can setup a "negative" rule, so "everything which doesn't have processed prefix". But you can rework your bucket structure a bit and dump unprocessed items into unprocessed and setup filter based on that prefix only.
When setting up an S3 trigger for lambda function, there is the possibility, to define which kind of overarching S3-event should be listened to:

Processing S3 events in order with Lambda

I am setting up an S3 bucket. In this S3 bucket, data is going to be written by an external process.
I am setting up an AWS Lambda that would be triggered when an object in S3 gets created/updated and would process and store the data in RDS.
Here my question is as follows:
If objects get written too fast on s3, there is a possibility for multiple Lambda functions
to get triggered simulatenously.
So, in this case, is there any chance for the objects to be processed not in the order they
are written to the S3 bucket?
If the answer to the above question is yes, then with Lambda, I have to push the payload to
FIFO SQS and set up a listener to process the payload to store the data into RDS finally.
Sadly, they are not guaranteed to be in order. From docs:
Event notifications are not guaranteed to arrive in the order that the events occurred. However, notifications from events that create objects (PUTs) and delete objects contain a sequencer, which can be used to determine the order of events for a given object key.

AWS lambda function and Athena to create partitioned table

Here's my requirements. Every day i'm receiving a CSV file into an S3 bucket. I need to partition that data and store it into Parquet to eventually map a Table. I was thinking about using AWS lambda function that is triggered whenever a file is uploaded. I'm not sure what are the steps to do that.
There are (as usual in AWS!) several ways to do this, the 2 first ones that come to me first are:
using a Cloudwatch Event, with an S3 PutObject Object level) action as trigger, and a lambda function that you have already created as a target.
starting from the Lambda function it is slightly easier to add suffix-filtered triggers, eg for any .csv file, by going to the function configuration in the Console, and in the Designer section adding a trigger, then choose S3 and the actions you want to use, eg bucket, event type, prefix, suffix.
In both cases, you will need to write the lambda function in either case to do the work you have described, and it will need IAM access to the bucket to pull the files and process them.

Copy data from S3 and post process

There is a service that generates data in S3 bucket that is used for warehouse querying. Data is inserted into S3 as daily mechanism.
I am interested in copying that data from S3 to my service account to further classify the data. The classification needs to happen in my AWS service account as it is based on information present in my service account. Classification needs to happens in my service account as it is specific to my team/service. The service generating the data in S3 is neither concerned about the classification nor has the data to make classification decision.
Each S3 file consists of json objects (record) in it. For every record, I need to look into a dynamodb table. Based on whether data exists in Dynamo table, I need to include an additional attribute to the json object and store the list into another S3 bucket in my account.
The way I am considering doing this:
Trigger a scheduled CW event periodically to invoke a Lambda that will copy the files from Source S3 bucket into a bucket (lets say Bucket A) in my account.
Then, use another scheduled CW event to invoke a Lambda to read the records in the json and compare with dynamodb table to determine classification and write to updated record to another bucket (lets say Bucket B).
I have few questions regarding this:
Are there better alternatives for achieving this?
Would using aws s3 sync in the first Lambda be a good way to achieve this? My concerns revolve around lambdas getting timed out due large amount of data, especially for the second lambda that needs to compare against DDB for every record.
Rather than setting up scheduled events, you can trigger the AWS Lambda functions in real-time.
Use Amazon S3 Events to trigger the Lambda function as soon as a file is created in the source bucket. The Lambda function can call CopyObject() to copy the object to Bucket-A for processing.
Similarly, an Event on Bucket-A could then trigger another Lambda function to process the file. Some things to note:
Lambda functions run for a maximum of 15 minutes
You can increase the memory assigned to a Lambda function, which will also increase the amount of CPU assigned. So, this might speed-up the function if it is taking longer than 15 minutes.
There is a maximum of 512MB of storage space made available for a Lambda function.
If the data is too big, or takes too long to process, then you will need to find a way to do it outside of AWS Lambda. For example, using Amazon EC2 instances.
If you can export the data from DynamoDB (perhaps on a regular basis), you might be able to use Amazon Athena to do all the processing, but that depends on what you're trying to do. If it is simple SELECT/JOIN queries, it might be suitable.

Let a resource trigger lambda only once

I have an event-driven data pipeline on AWS which processes millions of files. each file in my s3 bucket triggers a lambda. the lambda processes the data in the file and dumps the processed data to an s3 bucket, which in turn triggers another lambda etc.
Downstream of my pipeline I have a lambda which creates an Athena database and table. This lambda is triggered as soon as an object is dumped under the appropriate key of my s3 bucket. It's enough to call this lambda that creates my Athena database and table only once.
how can I avoid letting my labda being triggered multiple times?
This is your existing flow:
S3 trigger Lambda once new file arrives (event driven)
"Lambda to process the file" and then deliver to another S3
The other S3 also triggers another lambda
Your step 3 is not even driven, you are enforcing an event.
I suggest you the following flow:
S3 trigger Lambda once new file arrives (event driven)
"Lambda to process the file" and then deliver to another S3
Only two steps, the lambda that process the file should use Athena SDK and check if the desired table already exists, and only if not, then you call the Lambda that creates the Athena table. The delivery S3 should not trigger the lambda for Athena.