aws dynamodb stream lambda processes too quickly - amazon-web-services

I have DynamoDb table that I send data into, there is a stream that is being processed by a lambda, that rolls up some stats and inserts them back into the table.
My issue is that my lambda is processing the events too quickly, so almost every insert is being sent back to the dynamo table, and inserting them back into the dynamo table is causing throttling.
I need to slow my lambda down!
I have set my concurrency to 1
I had thought about just putting a sleep statement into the lambda code, but this will be billable time.
Can I delay the Lambda to only start once every x minutes?

You can't easily limit how often the Lambda runs, but you could re-architect things a little bit and use a scheduled CloudWatch Event as a trigger instead of your DynamoDB stream. Then you could have the Lambda execute every x minutes, collate the stats for records added since the last run, and push them to the table.

I never tried this myself, but I think you could do the following:
Put a delay queue between the stream and your Lambda.
That is, you would have a new Lambda function just pushing events from the DDB stream to this SQS queue. You can set an delay of up to 15 minutes on the queue. Then setup your original Lambda to be triggered by the messages in this queue. Be vary of SQS limits though.

As per lambda docs "By default, Lambda invokes your function as soon as records are available in the stream. If the batch it reads from the stream only has one record in it, Lambda only sends one record to the function. To avoid invoking the function with a small number of records, you can tell the event source to buffer records for up to 5 minutes by configuring a batch window. Before invoking the function, Lambda continues to read records from the stream until it has gathered a full batch, or until the batch window expires.", using this you can add a bit of a delay, maybe process the batch sequentially even after receiving it. Also, since execution faster is not your priority you will save cost as well. Less lambda function invocations, cost saved by not doing sleep. From aws lambda docs " You are charged based on the number of requests for your functions and the duration, the time it takes for your code to execute."

No, unfortunately you cannot do it.
Having the concurrency set to 1 will definitely help, but won't solve. What you could do instead would be to slightly increase your RCUs a little bit to prevent throttling.
To circumvent the problem though, #bwest's approach seems very good. I'd go with that.

Instead of putting delay or setting concurrency to 1, you can do the following
Increase the batch size, so that you process few events together. It will introduce some delay as well as cost less money.
Instead of putting data back to dynamodb, put it to another store where you are not charged by wcu but by amount of memory/ram you are using.
Have a cloudwatch triggered lambda, who takes data from this temporary store and puts it back to dynamodb.
This will make sure few things,
You can control the lag w.r.t. staleness of aggregated data. (i.e. you can have 2 strategy defined lets say 15 mins or 1000 events whichever is earlier)
You lambda won't have to discard the events when you are writing aggregated data very often. (this problem will be there even if you use sqs).

Related

Automated Real Time Data Processing on AWS with Lambda

I am interested in doing automated real-time data processing on AWS using Lambda and I am not certain about how I can trigger my Lambda function. My data processing code involves taking multiple files and concatenating them into a single data frame after performing calculations on each file. Since files are uploaded simultaneously onto S3 and files are dependent on each other, I would like the Lambda to be only triggered when all files are uploaded.
Current Approaches/Attempts:
-I am considering an S3 trigger, but my concern is that an S3 Trigger will result in an error in the case where a single file upload triggers the Lambda to start. An alternate option would be adding a wait time but that is not preferred to limit the computation resources used.
-A scheduled trigger using Cloudwatch/EventBridge, but this would not be real-time processing.
-SNS trigger, but I am not certain if the message can be automated without knowing the completion in file uploads.
Any suggestion is appreciated! Thank you!
If you really cannot do it with a scheduled function, the best option is to trigger a Lambda function when an object is created.
The tricky bit is that it will fire your function on each object upload. So you either can identify the "last part", e.g., based on some meta data, or you will need to store and track the state of all uploads, e.g. in a DynamoDB, and do the actual processing only when a batch is complete.
Best, Stefan
Your file coming in parts might be named as -
filename_part1.ext
filename_part2.ext
If any of your systems is generating those files, then use the system to generate a final dummy blank file name as -
filename.final
Since in your S3 event trigger you can use a suffix to generate an event, use .final extension to invoke lambda, and process records.
In an alternative approach, if you do not have access to the server putting objects to your s3 bucket, then with each PUT operation in your s3 bucket, invoke the lambda and insert an entry in dynamoDB.
You need to put a unique entry per file (not file parts) in dynamo with -
filename and last_part_recieved_time
The last_part_recieved_time keeps getting updated till you keep getting the file parts.
Now, this table can be looked up by a cron lambda invocation which checks if the time skew (time difference between SYSTIME of lambda invocation and dynamoDB entry - last_part_recieved_time) is enough to process the records.
I will still prefer to go with the first approach as the second one still has a chance for error.
Since you want this to be as real time as possible, perhaps you could just perform your logic every single time a file is uploaded, updating the version of the output as new files are added, and iterating through an S3 prefix per grouping of files, like in this other SO answer.
In terms of the architecture, you could add in an SQS queue or two to make this more resilient. An S3 Put Event can trigger an SQS message, which can trigger a Lambda function, and you can have error handling logic in the Lambda function that puts that event in a secondary queue with a visibility timeout (sort of like a backoff strategy) or back in the same queue for retries.

What's the best aws approach to send a notification message to validate if all records have been processed in dynamoDB

Introduction
We are building an application to process a monthly file, and there are many aws components involved in this project:
Lambda reads the file from S3, parse it and push it to dynamoDB with flag (PENDING) for each record.
Another Lambda will processing these records after the first Lambda is done, and to flag a record as (PROCESSED) after it's done with it.
Problem:
We want to send a result to SQS after all records are processed.
Our approach
Is to use DynamoDB streaming to trigger a lambda each time a record gets updated, and Lambda to query dynamoDB to check f all records are processed, and to send the notification when that's true.
Questions
Are there any other approach that can achieve this goal without triggering Lambda each time a record gets updated?
Are there a better approach that doesn't include DynamoDB streaming?
I would recommend Dynamodb Stream as they are reliable enough, triggering lambda for and update is pretty cheap, execution will be 1-100 ms usually. Even if you have millions of executions it is a robust solution. There is a way to a have shared counter of processed messages using elastic cache, once you receive update and counter is 0 you are complete.
Are there any other approach that can achieve this goal without
triggering Lambda each time a record gets updated?
Other option is having a scheduled lambda execution to check status of all processed from the db (query for PROCESSED) and move it to SQS. Depending on the load you could define how often to be run. (Trigger using cloudwatch scheduled event)
What about having a table monthly_file_process with row for every month having extra counter.
Once the the s3 files is read count the records and persist the total as counter. With every PROCESSED one decrease the counter , if the counter is 0 after the update send the SQS notification. This entire thing with sending to SQS could be done from 2 lambda which processes the record, just extra step checking the counter.

Scheduling a reminder email in AWS step function (through events/SES) based on Dynamo DB attributes

I have a step function with 3 lambdas, the last lambda is basically writing an entry in the dynamo DB with a timestamp, status = "unpaid" (this is updated to "paid" for some automatically based on another workflow), email and closes the execution. Now I want to schedule a reminder on any entry in the DynamoDB which is unpaid & over 7 days, a second reminder if any entry is unpaid over 14 days, a third last reminder on 19th day - sent via email. So the question is:
Is there any way to do this scheduling per Step function execution (that can then monitor that particular entry in ddb for 7, 14, 19 days and send reminders accordingly until the status is "unpaid").
If yes, would it be too much overhead since there could be millions of transactions.
The second way which I was thinking was to build another scheduler lambda sequence: the first lambda basically parsing through the whole ddb searching for entries valid for reminder (either 7, 14, 19). The second lambda getting the list from the first lambda and prepares the reminder based on whether its first, second or third (in loop) & the third Lambda one sending the reminder through SES.
Is there a better or easier way to do this?
I know we can trigger step functions or lambdas through cloud events or we also have crons that we can use but they were not suiting the use case much.
Any help here is appreciated?
DynamoDB does not have functionality for a delayed notification based on logic, you would need to design this flow yourself. Luckily AWS has all the tools you need to perform this.
I believe the best option would probably be to create a CloudWatch Events/EventBridge when the item is written to DynamoDB (either via your application or as a trigger via a Lambda using DynamoDB Streams).
This event would be scheduled for 7 days time, in the 7 days any checks could be performed to validate if it has been paid or not. If it has not been paid you schedule the next event and send out the notification. If it had been paid you would simply exit the Lambda function. This would then continue for the next 2 time periods.
You could then further enhance this by using DynamoDB streams so that in the event of the DynamoDB table being updated a Lambda is triggered to detect whether status has changed from unpaid. If this occurs simply remove the event trigger to prevent it even having to process.

Batching trigger Lambda on DynamoDB

I'm having a table as below
Key | Value
---------------------------------------------------
Client_123_UNIQUE_ID | s3://abc.txt
Client_123_UNIQUE_ID | s3://xyz.txt
Client_456_UNIQUE_ID | s3://qaz.txt
Client_456_UNIQUE_ID | s3://qwe.txt
Client_789_UNIQUE_ID | s3://asd.txt
Client_789_UNIQUE_ID | s3://zxc.txt
The data will be inserted consistently to this table from a AWS Lambda function. (maybe millions items)
I have an use case which I need to have a trigger when there is 100 items available in the table to perform some batch processing.
In other mean, as soon as we have 100 new items created in this table, I would like to have a trigger to Lambda function to perform a batch processing on 100 items.
When I research, it seems DynamoDB Stream can support Batch but I'm not pretty clear based on documentation.
Lambda reads records in batches and invokes your function to process
records from the batch.
Lambda polls shards in your DynamoDB Streams stream for records at a
base rate of 4 times per second. When records are available, Lambda
invokes your function and waits for the result. If processing
succeeds, Lambda resumes polling until it receives more records.
If your function returns an error, Lambda retries the batch until
processing succeeds or the data expires. Until the issue is resolved,
no data in the shard is processed. Handle any record processing errors
in your code to avoid stalled shards and potential data loss.
Could you please help me to clarify documentation or advice me the approach which we use DynamoDB Stream is correct one for this use case?
If I'm explaining my question is not good enough, please put comments so I can clarify more.
You can set the BatchSize when declaring the mapping between the stream and Lambda. The max size is 10000 items.

AWS Kinesis Lambda Scheduled Pull Invocation

There seems to be no way to tell lambdas to pull records in a scheduled manner.
This means that my lambda function never gets invoked unless the size of records meets the batch specification.
I'd like to my lambda function to get invoked eagerly so that it can pull records after a specified time elapses as well.
Imagine that you are building a real time analytics service that do not fill the specified batch size for a long time during off-peaks.
Is there any workaround to pull records periodically?
This means that my lambda function never gets invoked unless the size of records meets the batch specification.
That is not correct to my knowledge - can you provide the documentation that says so?
To my knowledge
AWS uses a daemon for polling the stream and check for new records. The daemon is what triggers the Lambda and it happens in one of the two cases:
Batch size crossed the specified limit (the one configured in Lambda).
Certain time had passed (don't know how much exactly) and current batch is not empty.
I had done a massive use of Kinesis and Lambda, I have configured the batch limit to 500 records (per invocation).
I have had invocations with less than 500 records, sometimes even ~20 records - this is a fact.