Lambda timeout and can't process all results from DynamoDB. Any tips to optimize this other than moving to Fargate? - amazon-web-services

Lambda needs to get all results from DynamoDB and performs for processing on each record and trigger a step function workflow. Although paginated result is given by DynamoDB, Lambda will timeout if there are too many pages which can't be processed within 15 mins lambda limit. Is there any workaround to use lambda other than moving to Fargate?
Overview of Lambda
while True:
l, nextToken = get list of records from DynamoDB
for each record in l:
perform some preprocesing like reading a file and triggering a workflow
if nextToken == None:
break

I assume processing one record can fit inside the 15-minute lambda limit.
What you can do is to make your original lambda as an orchestrator that calls a worker lambda that processes a single page.
Orchestrator Lambda
while True:
l, nextToken = get list of records from DynamoDB
for each record in l:
call the worker lambda by passing the record as the event
if nextToken == None:
break
Worker Lambda
perform some preprocesing like reading a file and triggering a workflow

You can use SQS to provide you with a method to process these in rapid succession. You can even use that to perform them more or less in parallel rather than in sync.
Lambda reads in Dynamodb -> breaks each entry into a json object -> sends the json object to SQS -> which queues them out to multiple invoked lambda -> that lambda is designed to handle one single entry end finish
Doing this allows you to split up long tasks that may take many many hours across multiple lambda invocations by designing the second lambda to only handle one iteration of the task - and using SQS as your loop/iterator. You can set settings on SQS to send as fast as possible or to send one at a time (though if you do the one at a time you will have to manage the time to live and staleness settings of the messages in the queue)
In addition, if this is a regular thing where new items get added to the dynamo that then have to be processed, you should make use of Dynamo Streams - everytime a new item is added that triggers a lambda to fire on that new item, allowing to to do your workflow in real time as items are added.

Related

Automated Real Time Data Processing on AWS with Lambda

I am interested in doing automated real-time data processing on AWS using Lambda and I am not certain about how I can trigger my Lambda function. My data processing code involves taking multiple files and concatenating them into a single data frame after performing calculations on each file. Since files are uploaded simultaneously onto S3 and files are dependent on each other, I would like the Lambda to be only triggered when all files are uploaded.
Current Approaches/Attempts:
-I am considering an S3 trigger, but my concern is that an S3 Trigger will result in an error in the case where a single file upload triggers the Lambda to start. An alternate option would be adding a wait time but that is not preferred to limit the computation resources used.
-A scheduled trigger using Cloudwatch/EventBridge, but this would not be real-time processing.
-SNS trigger, but I am not certain if the message can be automated without knowing the completion in file uploads.
Any suggestion is appreciated! Thank you!
If you really cannot do it with a scheduled function, the best option is to trigger a Lambda function when an object is created.
The tricky bit is that it will fire your function on each object upload. So you either can identify the "last part", e.g., based on some meta data, or you will need to store and track the state of all uploads, e.g. in a DynamoDB, and do the actual processing only when a batch is complete.
Best, Stefan
Your file coming in parts might be named as -
filename_part1.ext
filename_part2.ext
If any of your systems is generating those files, then use the system to generate a final dummy blank file name as -
filename.final
Since in your S3 event trigger you can use a suffix to generate an event, use .final extension to invoke lambda, and process records.
In an alternative approach, if you do not have access to the server putting objects to your s3 bucket, then with each PUT operation in your s3 bucket, invoke the lambda and insert an entry in dynamoDB.
You need to put a unique entry per file (not file parts) in dynamo with -
filename and last_part_recieved_time
The last_part_recieved_time keeps getting updated till you keep getting the file parts.
Now, this table can be looked up by a cron lambda invocation which checks if the time skew (time difference between SYSTIME of lambda invocation and dynamoDB entry - last_part_recieved_time) is enough to process the records.
I will still prefer to go with the first approach as the second one still has a chance for error.
Since you want this to be as real time as possible, perhaps you could just perform your logic every single time a file is uploaded, updating the version of the output as new files are added, and iterating through an S3 prefix per grouping of files, like in this other SO answer.
In terms of the architecture, you could add in an SQS queue or two to make this more resilient. An S3 Put Event can trigger an SQS message, which can trigger a Lambda function, and you can have error handling logic in the Lambda function that puts that event in a secondary queue with a visibility timeout (sort of like a backoff strategy) or back in the same queue for retries.

AWS Lambda - trigger synchronously repeatedly until a condition has been met

I have a use case where i want a scheduled lambda to read from a dynamodb table until there are no records left to process from its dynamodb query. I don't want to run lots of instances of the lamdba as it will hit a REST endpoint each time and don't want to overload this external service.
The reason I am thinking i can't use dynamo streams (please correct me if I am wrong here) is
this DDB is where messages will be sent when a legacy service is down, the scheduled error handler lambda that will read them would not want to try and process them as soon as they are inserted as it is likely the legacy service is still down. (is it possible with streams to update one row in the DB say legacy_service = alive and then trigger a lambda ONLY for the rows where processed_status = false)
I also don't want to have multiple instances of the lambda running at one time as i don't want to throttle the legacy service.
I would like a scheduled lambda that queries dynamodb table for all records that have processed_status = false, the query has a limit to only retrieve a small batch (1 or 2 messages) and process them ( I have this part implemented already) when this lambda is finished i would like it to trigger again and again until there is no records in the DDB with processed_status = false.
This can be done with recursive functions good tutorial here https://labs.ebury.rocks/2016/11/22/recursive-amazon-lambda-functions/

What's the best aws approach to send a notification message to validate if all records have been processed in dynamoDB

Introduction
We are building an application to process a monthly file, and there are many aws components involved in this project:
Lambda reads the file from S3, parse it and push it to dynamoDB with flag (PENDING) for each record.
Another Lambda will processing these records after the first Lambda is done, and to flag a record as (PROCESSED) after it's done with it.
Problem:
We want to send a result to SQS after all records are processed.
Our approach
Is to use DynamoDB streaming to trigger a lambda each time a record gets updated, and Lambda to query dynamoDB to check f all records are processed, and to send the notification when that's true.
Questions
Are there any other approach that can achieve this goal without triggering Lambda each time a record gets updated?
Are there a better approach that doesn't include DynamoDB streaming?
I would recommend Dynamodb Stream as they are reliable enough, triggering lambda for and update is pretty cheap, execution will be 1-100 ms usually. Even if you have millions of executions it is a robust solution. There is a way to a have shared counter of processed messages using elastic cache, once you receive update and counter is 0 you are complete.
Are there any other approach that can achieve this goal without
triggering Lambda each time a record gets updated?
Other option is having a scheduled lambda execution to check status of all processed from the db (query for PROCESSED) and move it to SQS. Depending on the load you could define how often to be run. (Trigger using cloudwatch scheduled event)
What about having a table monthly_file_process with row for every month having extra counter.
Once the the s3 files is read count the records and persist the total as counter. With every PROCESSED one decrease the counter , if the counter is 0 after the update send the SQS notification. This entire thing with sending to SQS could be done from 2 lambda which processes the record, just extra step checking the counter.

aws dynamodb stream lambda processes too quickly

I have DynamoDb table that I send data into, there is a stream that is being processed by a lambda, that rolls up some stats and inserts them back into the table.
My issue is that my lambda is processing the events too quickly, so almost every insert is being sent back to the dynamo table, and inserting them back into the dynamo table is causing throttling.
I need to slow my lambda down!
I have set my concurrency to 1
I had thought about just putting a sleep statement into the lambda code, but this will be billable time.
Can I delay the Lambda to only start once every x minutes?
You can't easily limit how often the Lambda runs, but you could re-architect things a little bit and use a scheduled CloudWatch Event as a trigger instead of your DynamoDB stream. Then you could have the Lambda execute every x minutes, collate the stats for records added since the last run, and push them to the table.
I never tried this myself, but I think you could do the following:
Put a delay queue between the stream and your Lambda.
That is, you would have a new Lambda function just pushing events from the DDB stream to this SQS queue. You can set an delay of up to 15 minutes on the queue. Then setup your original Lambda to be triggered by the messages in this queue. Be vary of SQS limits though.
As per lambda docs "By default, Lambda invokes your function as soon as records are available in the stream. If the batch it reads from the stream only has one record in it, Lambda only sends one record to the function. To avoid invoking the function with a small number of records, you can tell the event source to buffer records for up to 5 minutes by configuring a batch window. Before invoking the function, Lambda continues to read records from the stream until it has gathered a full batch, or until the batch window expires.", using this you can add a bit of a delay, maybe process the batch sequentially even after receiving it. Also, since execution faster is not your priority you will save cost as well. Less lambda function invocations, cost saved by not doing sleep. From aws lambda docs " You are charged based on the number of requests for your functions and the duration, the time it takes for your code to execute."
No, unfortunately you cannot do it.
Having the concurrency set to 1 will definitely help, but won't solve. What you could do instead would be to slightly increase your RCUs a little bit to prevent throttling.
To circumvent the problem though, #bwest's approach seems very good. I'd go with that.
Instead of putting delay or setting concurrency to 1, you can do the following
Increase the batch size, so that you process few events together. It will introduce some delay as well as cost less money.
Instead of putting data back to dynamodb, put it to another store where you are not charged by wcu but by amount of memory/ram you are using.
Have a cloudwatch triggered lambda, who takes data from this temporary store and puts it back to dynamodb.
This will make sure few things,
You can control the lag w.r.t. staleness of aggregated data. (i.e. you can have 2 strategy defined lets say 15 mins or 1000 events whichever is earlier)
You lambda won't have to discard the events when you are writing aggregated data very often. (this problem will be there even if you use sqs).

How do you run functions in parallel?

My desire is to retrieve x number of records from a database based on some custom select statement, the output will be an array of json data. I then want to pass each element in the array into another lambda function in parallel.
So if 1000 records are returned, 1000 lambda functions need to be executed in parallel (I increase my account limit to what I need). If 30 out of 1000 fail, the main task that was retrieving the records needs to know about it.
I'm struggling to put together this simple flow.
I currently use javascript and AWS Aurora. I'm not looking for node.js/javascript code that retrieves the data, just the AWS Step Functions configuration and how to build an array within each function.
Thank you.
if 1000 records are returned, 1000 lambda functions need to be
executed in parallel
What you are trying to achieve is not supported by Step Functions. A State Machine task cannot be modified based on the input it received. So for instance, a Parallel task cannot be configured to add/remove functions based on the number of items it received in an array input.
You should probably consider using SQS Lambda trigger. Number of records retrieved from DB can be added to SQS queue which will then trigger a Lambda function for each item received.
If 30 out of 1000 fail, the main task that was retrieving the records
needs to know about it.
There are various ways to achieve this. SQS won't delete an item from the queue if Lambda returns an error. You can configure DLQ and RedrivePolicy based on your requirements. Or you may want to come up with a custom solution to keep the count on failing Lambdas to invoke the service that fetch records from the DB.