How to handle multiple fan-outs using AWS Lambdas - amazon-web-services

I have one AWS lambda that kicks off (SNS events) multiple lambdas which in turn kick off (SNS events) multiple lambdas. All of these lambdas are writing files to S3 and I need to know when all files have been written. There will be another lambda which will send a final SNS message containing all references to the files produced. The amount of fan-out in the second set of lambdas is unknown as depends on the first fan-out.
If this was a single fan-out I would know how many files to be looking for but as it is a 2 step fan-out I am unsure as to how to monitor for all files. Has anybody dealt with this before? Thanks.

I would create a DynamoDB table for tracking this process. Create a single record in the table when the initial Lambda function kicks off, with a unique ID like a UUID or something if you don't already have a unique ID for this process. Also add that unique ID to the SNS messages, this will be the key used for all updates performed by the other processes. Also add a splitters_invoked to the record when it is created by the first process with the number of second level splitter functions it is invoking, and a splitters_complete property set to 0.
Inside the second level splitter functions you can use the DynamoDB feature Conditional Updates to update the DynamoDB record with the list of files created with their S3 locations. The second level splitter functions will also use the DynamoDB Atomic Counters feature to update the splitters_complete count just before they exit.
At the "process" level, each of those invocations will perform another Conditional Update to the DynamoDB record flagging the individual file they just processed as complete.
Finally, configure DynamoDB streams to trigger another Lambda function. This lambda function will check two conditions: splitters_complete is equal to splitters_invoked, and all files in the file list are marked as "completed". Then it will know that it can perform the final step in your process.
Alternatively, if you don't want to keep the list of S3 file locations in the DynamoDB table, simply use atomic counters for that as well, one counter for the total number of files created by the second level splitters, and another counter for the file processing functions.

Related

Configure multiple delete event in S3/Lambda

I am trying to build a Lambda function that gets triggered on S3 delete events. If multiple items are deleted at once, I want to use an S3 batch job. What I can't figure out or find in the documentation is what an event like that would look like. I'd assume it would just have multiple similar items in Records and I could iterate through, get all the keys, and then batch delete, but I can't confirm that. I've searched the documentation, and I built a test Lambda that would just log the event, but that came through as multiple distinct events. I'm stumped as to how to do what I'm trying here.
The s3 event you need to subscribe to is s3:ObjectRemoved:Delete that by documentation is used to track an object or a batch of objects being removed:
By using the ObjectRemoved event types, you can enable notification when an object or a batch of objects is removed from a bucket.
You can expect an event structured as detailed here.
However since in the comment you said you just wanted to "copy the objects pre-delete to another bucket" you may want to explore S3 buckets versioning capabilities.
Enabling versioning will allow you to preserve in a "deleted" state the objects, leaving room for future restores, as per delete workflow here.

Automated Real Time Data Processing on AWS with Lambda

I am interested in doing automated real-time data processing on AWS using Lambda and I am not certain about how I can trigger my Lambda function. My data processing code involves taking multiple files and concatenating them into a single data frame after performing calculations on each file. Since files are uploaded simultaneously onto S3 and files are dependent on each other, I would like the Lambda to be only triggered when all files are uploaded.
Current Approaches/Attempts:
-I am considering an S3 trigger, but my concern is that an S3 Trigger will result in an error in the case where a single file upload triggers the Lambda to start. An alternate option would be adding a wait time but that is not preferred to limit the computation resources used.
-A scheduled trigger using Cloudwatch/EventBridge, but this would not be real-time processing.
-SNS trigger, but I am not certain if the message can be automated without knowing the completion in file uploads.
Any suggestion is appreciated! Thank you!
If you really cannot do it with a scheduled function, the best option is to trigger a Lambda function when an object is created.
The tricky bit is that it will fire your function on each object upload. So you either can identify the "last part", e.g., based on some meta data, or you will need to store and track the state of all uploads, e.g. in a DynamoDB, and do the actual processing only when a batch is complete.
Best, Stefan
Your file coming in parts might be named as -
filename_part1.ext
filename_part2.ext
If any of your systems is generating those files, then use the system to generate a final dummy blank file name as -
filename.final
Since in your S3 event trigger you can use a suffix to generate an event, use .final extension to invoke lambda, and process records.
In an alternative approach, if you do not have access to the server putting objects to your s3 bucket, then with each PUT operation in your s3 bucket, invoke the lambda and insert an entry in dynamoDB.
You need to put a unique entry per file (not file parts) in dynamo with -
filename and last_part_recieved_time
The last_part_recieved_time keeps getting updated till you keep getting the file parts.
Now, this table can be looked up by a cron lambda invocation which checks if the time skew (time difference between SYSTIME of lambda invocation and dynamoDB entry - last_part_recieved_time) is enough to process the records.
I will still prefer to go with the first approach as the second one still has a chance for error.
Since you want this to be as real time as possible, perhaps you could just perform your logic every single time a file is uploaded, updating the version of the output as new files are added, and iterating through an S3 prefix per grouping of files, like in this other SO answer.
In terms of the architecture, you could add in an SQS queue or two to make this more resilient. An S3 Put Event can trigger an SQS message, which can trigger a Lambda function, and you can have error handling logic in the Lambda function that puts that event in a secondary queue with a visibility timeout (sort of like a backoff strategy) or back in the same queue for retries.

How do you run functions in parallel?

My desire is to retrieve x number of records from a database based on some custom select statement, the output will be an array of json data. I then want to pass each element in the array into another lambda function in parallel.
So if 1000 records are returned, 1000 lambda functions need to be executed in parallel (I increase my account limit to what I need). If 30 out of 1000 fail, the main task that was retrieving the records needs to know about it.
I'm struggling to put together this simple flow.
I currently use javascript and AWS Aurora. I'm not looking for node.js/javascript code that retrieves the data, just the AWS Step Functions configuration and how to build an array within each function.
Thank you.
if 1000 records are returned, 1000 lambda functions need to be
executed in parallel
What you are trying to achieve is not supported by Step Functions. A State Machine task cannot be modified based on the input it received. So for instance, a Parallel task cannot be configured to add/remove functions based on the number of items it received in an array input.
You should probably consider using SQS Lambda trigger. Number of records retrieved from DB can be added to SQS queue which will then trigger a Lambda function for each item received.
If 30 out of 1000 fail, the main task that was retrieving the records
needs to know about it.
There are various ways to achieve this. SQS won't delete an item from the queue if Lambda returns an error. You can configure DLQ and RedrivePolicy based on your requirements. Or you may want to come up with a custom solution to keep the count on failing Lambdas to invoke the service that fetch records from the DB.

How to build an index of S3 objects when data exceeds object metadata limit?

Building an index of S3 objects can be very useful to make them searchable quickly : the natural, most obvious way is to store additional data on the object meta-data and use a lambda to write in DynamoDB or RDS, as described here: https://aws.amazon.com/blogs/big-data/building-and-maintaining-an-amazon-s3-metadata-index-without-servers/
However, this strategy is limited by the amount of data one can store in the object metadata, which is 2KB, as described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html. Suppose you need to build a system where every time an object is uploaded on S3 you store need to add some information not contained in the file and the object name to a database and this data exceeds 2KB:you can't store it in the object metadata.
What are viable strategies to keep the bucket and the index updated?
Implement two chained API calls where each call is idempotent: if the second fails when the first succeed, one can retry until success. What happens if you perform a PUT of an identical object on S3, and you have versioning activated? Will S3 increase the version? In this case, implementing idempotency requires a single writer to be active at each time
Use some sort of workflow engine to keep track of this two-step behaviour, such as AWS Step. What are the gotchas with this solution?

AWS - want to upload multiple files to S3 and only when all are uploaded trigger a lambda function

I am seeking advice on what's the best way to design this -
Use Case
I want to put multiple files into S3. Once all files are successfully saved, I want to trigger a lambda function to do some other work.
Naive Approach
The way I am approaching this is by saving a record in Dynamo that contains a unique identifier and the total number of records I will be uploading along with the keys that should exist in S3.
A basic implementation would be to take my existing lambda function which is invoked anytime my S3 bucket is written into, and have it check manually whether all the other files been saved.
The Lambda function would know (look in Dynamo to determine what we're looking for) and query S3 to see if the other files are in. If so, use SNS to trigger my other lambda that will do the other work.
Edit: Another approach is have my client program that puts the files in S3 be responsible for directly invoking the other lambda function, since technically it knows when all the files have been uploaded. The issue with this approach is that I do not want this to be the responsibility of the client program... I want the client program to not care. As soon as it has uploaded the files, it should be able to just exit out.
Thoughts
I don't think this is a good idea. Mainly because Lambda functions should be lightweight, and polling the database from within the Lambda function to get the S3 keys of all the uploaded files and then checking in S3 if they are there - doing this each time seems ghetto and very repetitive.
What's the better approach? I was thinking something like using SWF but am not sure if that's overkill for my solution or if it will even let me do what I want. The documentation doesn't show real "examples" either. It's just a discussion without much of a step by step guide (perhaps I'm looking in the wrong spot).
Edit In response to mbaird's suggestions below-
Option 1 (SNS) This is what I will go with. It's simple and doesn't really violate the Single Responsibility Principal. That is, the client uploads the files and sends a notification (via SNS) that its work is done.
Option 2 (Dynamo streams) So this is essentially another "implementation" of Option 1. The client makes a service call, which in this case, results in a table update vs. a SNS notification (Option 1). This update would trigger the Lambda function, as opposed to notification. Not a bad solution, but I prefer using SNS for communication rather than relying on a database's capability (in this case Dynamo streams) to call a Lambda function.
In any case, I'm using AWS technologies and have coupling with their offering (Lambda functions, SNS, etc.) but I feel relying on something like Dynamo streams is making it an even tighter coupling. Not really a huge concern for my use case but still feels dirty ;D
Option 3 with S3 triggers My concern here is the possibility of race conditions. For example, if multiple files are being uploaded by the client simultaneously (think of several async uploads fired off at once with varying file sizes), what if two files happen to finish uploading at around the same time, and two or more Lambda functions (or whatever implementations we use) query Dynamo and gets back N as the completed uploads (instead of N and N+1)? Now even though the final result should be N+2, each one would add 1 to N. Nooooooooooo!
So Option 1 wins.
If you don't want the client program responsible for invoking the Lambda function directly, then would it be OK if it did something a bit more generic?
Option 1: (SNS) What if it simply notified an SNS topic that it had completed a batch of S3 uploads? You could subscribe your Lambda function to that SNS topic.
Option 2: (DynamoDB Streams) What if it simply updated the DynamoDB record with something like an attribute record.allFilesUploaded = true. You could have your Lambda function trigger off the DynamoDB stream. Since you are already creating a DynamoDB record via the client, this seems like a very simple way to mark the batch of uploads as complete without having to code in knowledge about what needs to happen next. The Lambda function could then check the "allFilesUploaded" attribute instead of having to go to S3 for a file listing every time it is called.
Alternatively, don't insert the DynamoDB record until all files have finished uploading, then your Lambda function could just trigger off new records being created.
Option 3: (continuing to use S3 triggers) If the client program can't be changed from how it works today, then instead of listing all the S3 files and comparing them to the list in DynamoDB each time a new file appears, simply update the DynamoDB record via an atomic counter. Then compare the result value against the size of the file list. Once the values are the same you know all the files have been uploaded. The down side to this is that you need to provision enough capacity on your DynamoDB table to handle all the updates, which is going to increase your costs.
Also, I agree with you that SWF is overkill for this task.