My lambda is triggered by a dynamodb table stream. Based on the doc: https://docs.aws.amazon.com/lambda/latest/dg/with-ddb.html, Lambda polls shards in your DynamoDB stream for records at a base rate of 4 times per second. When records are available, Lambda invokes your function and waits for the result. If processing succeeds, Lambda resumes polling until it receives more records.
This means I will get about 250 millionseconds latency to trigger my lambda when there is an update happens on dynamodb. Is there a way to improve this pull rate?
You can not change the polling interval, you can only change things like the batch size or the parallelization factor.
Here you can look up the possibilities of configuration, when invoking a lambda through DynamoDBStreams:
https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/sam-property-function-dynamodb.html
Related
We have configured DynamoDB streams to trigger a Lambda function. More than 10 million unique records will be inserted into DynamoDB table within 30 minutes and Lambda will process these records when triggered through streams.
As per DynamoDB Streams documentation, streams will expire after 24 hrs.
Question:
Does this mean that Lambda function (multiple concurrent executions) should complete processing of all 10 million records within 24hrs?
If some streams events remain to be processed after 24hrs, will they be lost?
As long as you don't throttle the lambda, it won't 'not keep up'.
What will happen is the stream will be batched depending on your settings - so if you have your settings in your dynamo stream to 5 events at once, it will bundle five events and push them toward lambda.
even if that happens hundreds of times a minute, Lambda will (assuming again you aren't purposely limiting lambda executions) spin up additional concurrent executions to handle the load.
This is standard AWS philosophy. Pretty much every serverless resource (and even some not, like EC2 with Elastics Beanstalk) are designed to seamlessly and effortless scale horizontally to handle burst traffic.
Likely your Lambda executions will be done within a couple of minutes of the last event being sent. The '24 hour time out' is against waiting for a lambda to be finished/reactivated (ie: you can set up cloudwatch events to 'hold' Dynamo Streams until certain times of the day then process everything, such as waiting until off hours to let all the streams process, then turning it off again during business hours the next day)
To give you an example that is similar - I ran 10,000 executions through an SQS into a lambda. It completed the 10,000 executions in about 15 mins. Lambda concurrency is designed to handle this kind of burst flow.
Your Dynamo Read/Write capacity is going to be hammered however, so make sure you have it set to at least dynamic and not provisioned.
UPDATE
As #Maurice pointed out in the comments, there is a Stream Limit on concurrent batches sent at a moment with Dynamo. The calculation indicates that it will fall far short even with a short lambda execution time - longer the lambda, the less likely you are to complete.
Which means, if you don't have to have those all processed as quickly as posisble you should divide up the input.
You can add an AWS SQS queue somewhere in the process. Most likely, because even with the largest batch size and and a super quick process you wont get through all them, before the insert into dynamo.
The SQS has limits on its messages of up to 14 days. This may be enough to do what you want. If you have control of the messages coming in you can insert them into an sqs queue with a wait attached to it in order to process a smaller amount inserts at once - what can be accomplished in a single day, or well slightly less. It would be
lambda to collate your inserts into an SQS queue -> SQS with a wait/smaller batch size -> Lambda to insert smaller batches into dynamo -> Dynamo Stream -> Processing Lambda
The other option is to do something similar but use a State Machine with wait times and maps. State Machines have a 1 year run time limit, so you have plenty of time with that one.
The final option is to, instead of streaming the data straight into lambda, execute lambdas to query smaller sections of the dynamo at once to process them
I am using Lambda function to read data from dyanmoDB streams. Lambda read items from stream and invokes lambda function once for each batch. Lambda invokes lambda function synchronously using event source mapping.
From what i understand from aws docs is, Lambda invokes a lambda function for each batch in the stream. Suppose there are 1000 items in stream instantly and I configures my lambda function to read 100 items in a batch.
So will it invoke 10 lambda function concurrently to process 10 batch of 100 items each?
I am learning AWS. Is my understanding correct? if yes what does synchronously invoked mean?
DynamoDB uses shards* to partition the data inside a table. The data that will be stored in each shard is defined by the HashKey of the table. DynamoDB streams will trigger AWS Lambda for each shard that was updated and aggregate the shard records in a batch. So the number of records in each batch will depend on the number of records updated in each shard. They can be different of course.
Synchronously invoked means that the service that triggered the function will wait until the function ends to finish its own execution. When you trigger asynchronous, you send a request and forget about it. If the downstream function successfully process the stream or not is not in control of the upstream service. DynamoDB invokes Lambda Function synchronously and waits while it works. If it ends successfully, it will mark the stream as processed. If it ends with a failure it will retry a few more times. This is important to allow at least once processing of ddb streams.
*Shards are different partitions of the database. They allow DynamoDB to process parallel queries and updates. Normally they reside in different storages/availability zones.
I'd like to execute a lambda function with multiple data, only after a fixed amount of data is gathered. The fixed amount would be, for example, to consider only a specific amount of messages, or messages that are sent in a specific temporal range.
I thought to solve this problem using an SQS, on which I write the messages, and using a polling to check the SQS status. But I don't like this solution, because I'd like to trigger the lambda instantly when the criteria is matched (for example: elapsed time from the first message sent, or a fixed amount of messages)
The ideal would be to send all the messages gathered, for example, after 1 minute after the first message arrives.
To be clear:
First message arrives in the queue
From now on starts a timer (e.g 1 min)
The timer ends and It will trigger the lambda with all the messages gathered till now
Moreover, I'd like to handle different queues in parallel, based on different ids
Is there an elegant way to do so?
I have already in place a system that works with sequential lambda, that handles all the process per single message
Unfortunately, it's not an easy task to do on AWS Lambda (we have a similar use case).
SQS or Kinesis data stream as a trigger can be helpful, but have several limitations:
SQS will be pulled by AWS Lambda in a very high frequency. You will have to add a concurrency limit to your lambda to make it get triggered by more than a single item. And the maximum batch size is just 10.
The base rate for Kinesis trigger is one per second for each shard, and cannot be changed.
Aggregating records between different invocations is not a good idea because you never know if the next invocation will start on a different container so they will get lost.
Kinesis Firehose can be helpful, as you can configure max batch size and max time range for sending a new batch. You can configure it to write to an S3 bucket and configure a lambda to be triggered by new created files.
Make sure that if you use a Kinesis data stream as the source of a Kinesis firehose, the data from each shard of the data stream is seperately batched in the Firehose (this is not documented in AWS).
You can do this in a few ways. I'd do it like this:
Have the queue be an event source for a lambda function
That lambda function can: trigger a state machine OR not do anything. It triggers the state machine if there isn't one currently triggered (meaning we're in that 1 minute range).
The state machine has the following steps:
1 minute wait
Does it's processing
I'm in the process of writing a Lambda function that processes items from a DynamoDB stream.
I thought part of the point behind Lambda was that if I have a large burst of events, it'll spin up enough instances to get through them concurrently, rather than feeding them sequentially through a single instance. As long as two events have a different key, I am fine with them being processed out of order.
However, I just read this page on Understanding Retry Behavior, which says:
For stream-based event sources (Amazon Kinesis Data Streams and DynamoDB streams), AWS Lambda polls your stream and invokes your Lambda function. Therefore, if a Lambda function fails, AWS Lambda attempts to process the erring batch of records until the time the data expires, which can be up to seven days for Amazon Kinesis Data Streams. The exception is treated as blocking, and AWS Lambda will not read any new records from the stream until the failed batch of records either expires or processed successfully. This ensures that AWS Lambda processes the stream events in order.
Does "AWS Lambda processes the stream events in order" mean Lambda cannot process multiple events concurrently? Is there any way to have it process events from distinct keys concurrently?
With AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sources, the order is still guaranteed for each partition key, but not necessarily within each shard when Concurrent batches per shard is set to be greater than 1. Therefore the accepted answer needs to be revised.
Stream records are organized into groups, or shards.
According to Lambda documentation, the concurrency is achieved on shard-level. Within each shard, the stream events are processed in order.
Stream-based event sources : for Lambda functions that process Kinesis
or DynamoDB streams the number of shards is the unit of concurrency.
If your stream has 100 active shards, there will be at most 100 Lambda
function invocations running concurrently. This is because Lambda
processes each shard’s events in sequence.
And according to Limits in DynamoDB,
Do not allow more than two processes to read from the same DynamoDB
Streams shard at the same time. Exceeding this limit can result in
request throttling.
My lambda function contains some code that includes queries to dynamodb. Once a query is executed, the lambda continues with the rest of the code, which is based on the result of that query. What happens if I exceed the capacity limit of the dynamodb? I can push the query to SQS and process it later, but then I will not be able to continue the execution of the lambda. Another solution would be to retry each query that fails, but if the dynamodb is extremely busy, my lambda might exceed the 5 minute limit. Seems like a lose-lose situation. What would you do?
The most fault-tolerant solution would be to decouple the querying and the processing of the results.
Instead of processing results immediately, write the results to another SQS queue and send an SNS notification.
Move the processing to a second Lambda function. This new function can be triggered by the SNS notification. It can read the results queue and process any pending messages.
Modify the original function to queue any failed queries for later.