I'm in the process of writing a Lambda function that processes items from a DynamoDB stream.
I thought part of the point behind Lambda was that if I have a large burst of events, it'll spin up enough instances to get through them concurrently, rather than feeding them sequentially through a single instance. As long as two events have a different key, I am fine with them being processed out of order.
However, I just read this page on Understanding Retry Behavior, which says:
For stream-based event sources (Amazon Kinesis Data Streams and DynamoDB streams), AWS Lambda polls your stream and invokes your Lambda function. Therefore, if a Lambda function fails, AWS Lambda attempts to process the erring batch of records until the time the data expires, which can be up to seven days for Amazon Kinesis Data Streams. The exception is treated as blocking, and AWS Lambda will not read any new records from the stream until the failed batch of records either expires or processed successfully. This ensures that AWS Lambda processes the stream events in order.
Does "AWS Lambda processes the stream events in order" mean Lambda cannot process multiple events concurrently? Is there any way to have it process events from distinct keys concurrently?
With AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sources, the order is still guaranteed for each partition key, but not necessarily within each shard when Concurrent batches per shard is set to be greater than 1. Therefore the accepted answer needs to be revised.
Stream records are organized into groups, or shards.
According to Lambda documentation, the concurrency is achieved on shard-level. Within each shard, the stream events are processed in order.
Stream-based event sources : for Lambda functions that process Kinesis
or DynamoDB streams the number of shards is the unit of concurrency.
If your stream has 100 active shards, there will be at most 100 Lambda
function invocations running concurrently. This is because Lambda
processes each shard’s events in sequence.
And according to Limits in DynamoDB,
Do not allow more than two processes to read from the same DynamoDB
Streams shard at the same time. Exceeding this limit can result in
request throttling.
Related
Kinesis stream has only 1 shard and when creating Lambda, concurrent batches per shard for Kinesis stream source has been set as 10. When there is a spike in stream data, it will increase the concurrencies to 10. That means we will have 10 lambdas working in parallel. My question in this case is, how we can guarantee to process event stream serailly? It seems to me that it is impossible to do that because we can't control concurrencies. Can anyone have an idea for this? I can't get my head round.
AWS Lambda supports concurrent batch processing per shard and serial event processing, as long as all events in the Kinesis stream have the same partition key.
From AWS documentation:
You can also increase concurrency by processing multiple batches from each shard in parallel. Lambda can process up to 10 batches in each shard simultaneously. If you increase the number of concurrent batches per shard, Lambda still ensures in-order processing at the partition-key level.
References:
Using AWS Lambda with Amazon Kinesis (AWS)
Partition Key (Amazon Kinesis Data Streams Terminology and Concepts)
I have a Lambda with an Event source pointed to a Kinesis Stream Consumer (with an arbitrary number of shards)
I would like to ensure that items in the stream with the same 'partition key' are processed by Lambda in sequence and not simultaneously. ( This is being used as the object's identity, and I don't want multiple Lambdas performing logic on the same object simultaneously.)
For example, if the items in the stream have partition keys:
1,2,1,3,4,1,2,1
If we take the order of processing to be left to right, Lambda would process an item with each of the partition keys 1,2, 3 and 4 concurrently. Then, when it has finished an item with a specific partition key it can start processing another one with that key.
Is this achievable in some way, without the use of a distributed lock that would make inefficient use of Lambda?
Thanks
Items with the same 'partition key' will be processed by Lambda in sequence for stream event source mapping.
Moreover, you can specify 'concurrent batches per shard' when creating Lambda trigger:
If 'concurrent batches per shard' is 1 (default one), then the order will be preserved for the whole shard.
If 'concurrent batches per shard' is [2;10], then the order will be preserved only for records with the same partition key within the shard.
You can check about concurrent batches (ParallelizationFactor) in https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html
Seems like I was tackling the problem in the wrong way. Lambda guarantees that within a shard, the Lambda instance is invoked on one batch at a time. Therefore, there is no need for a distributed lock as at worst there will be multiple records belonging to the same entity in the same batch and processing them in order can be managed in-memory within the Lambda function itself.
Reference from the AWS FAQs http://aws.amazon.com/lambda/faqs/
Q: How does AWS Lambda process data from Amazon Kinesis streams and
Amazon DynamoDB Streams?
The Amazon Kinesis and DynamoDB Streams records sent to your AWS
Lambda function are strictly serialized, per shard. This means that if
you put two records in the same shard, Lambda guarantees that your
Lambda function will be successfully invoked with the first record
before it is invoked with the second record. If the invocation for one
record times out, is throttled, or encounters any other error, Lambda
will retry until it succeeds (or the record reaches its 24-hour
expiration) before moving on to the next record. The ordering of
records across different shards is not guaranteed, and processing of
each shard happens in parallel.
I am using Lambda function to read data from dyanmoDB streams. Lambda read items from stream and invokes lambda function once for each batch. Lambda invokes lambda function synchronously using event source mapping.
From what i understand from aws docs is, Lambda invokes a lambda function for each batch in the stream. Suppose there are 1000 items in stream instantly and I configures my lambda function to read 100 items in a batch.
So will it invoke 10 lambda function concurrently to process 10 batch of 100 items each?
I am learning AWS. Is my understanding correct? if yes what does synchronously invoked mean?
DynamoDB uses shards* to partition the data inside a table. The data that will be stored in each shard is defined by the HashKey of the table. DynamoDB streams will trigger AWS Lambda for each shard that was updated and aggregate the shard records in a batch. So the number of records in each batch will depend on the number of records updated in each shard. They can be different of course.
Synchronously invoked means that the service that triggered the function will wait until the function ends to finish its own execution. When you trigger asynchronous, you send a request and forget about it. If the downstream function successfully process the stream or not is not in control of the upstream service. DynamoDB invokes Lambda Function synchronously and waits while it works. If it ends successfully, it will mark the stream as processed. If it ends with a failure it will retry a few more times. This is important to allow at least once processing of ddb streams.
*Shards are different partitions of the database. They allow DynamoDB to process parallel queries and updates. Normally they reside in different storages/availability zones.
Do triggers on DynamoDB tables have some sort of internal synchronization to keep everything in the order it is supposed to be?
Example: my trigger batch size is 1 and it's configured to always start reading from the latest entry. Two entries are made to the DB at one millisecond apart (or at the same time). I don't know the time it takes for the trigger and lambda function to be invoked but let's say for argument's sake it's longer than the time between DB entries (>1ms). Can I be sure that both lambda invocations don't receive the data from the second DB entry?
DynamoDB Streams doesn't send duplicates.
No, DynamoDB Streams is designed so that every update made to your
table will be represented exactly once in the stream.
DynamoDB Streams guarantees the following:
Each stream record appears exactly once in the stream. For each item
that is modified in a DynamoDB table, the stream records appear in the
same sequence as the actual modifications to the item.
DynamoDB Streams provides a time-ordered sequence of item-level changes made to data in a table.
As there is few milli-seconds difference between the update 1 and 2, Lambda should get two streams in the time ordered sequence (i.e. update 1 and then update 2).
Processing Streams Records on Lamdba:-
The Amazon Kinesis and DynamoDB Streams records sent to your AWS
Lambda function are strictly serialized, per shard. This means that if
you put two records in the same shard, Lambda guarantees that your
Lambda function will be successfully invoked with the first record
before it is invoked with the second record. If the invocation for one
record times out, is throttled, or encounters any other error, Lambda
will retry until it succeeds (or the record reaches its 24-hour
expiration) before moving on to the next record. The ordering of
records across different shards is not guaranteed, and processing of
each shard happens in parallel.
Stream-based event sources –
If you create a Lambda function that processes events from
stream-based services (Amazon Kinesis Streams or DynamoDB streams),
the number of shards per stream is the unit of concurrency. If your
stream has 100 active shards, there will be 100 Lambda functions
running concurrently. Then, each Lambda function processes events on a
shard in the order that they arrive.
Short Answer:-
Stream ensure that there is no duplicates. So there is no way that 2
Lambda invocations receive same data
Reg the processing of stream records i.e. whether the second update
processing starts after the first update processing depends on shard
per stream (unit of concurrency)
Because shards have a lineage (parent and children), applications must always process a parent shard before it processes a child shard. This will ensure that the stream records are also processed in the correct order. Use DynamoDB Streams Kinesis Adapter if you wanted to preserve the correct processing order.
Can someone explain what happens to events when a Lambda is subscribed to Kinesis item create events. There is a limit of 100 concurrent requests for an account in AWS, so if 1,000,000 items are added to kinesis how are the events handled, are they queued up for the next available concurrent lambda?
From the FAQ http://aws.amazon.com/lambda/faqs/
"Q: How does AWS Lambda process data from Amazon Kinesis streams and Amazon DynamoDB Streams?
The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel."
What this means is if you have 1M items added to Kinesis, but only one shard, the throttle doesn't matter - you will only have one Lambda function instance reading off that shard in serial, based on the batch size you specified. The more shards you have, the more concurrent invocations your function will see. If you have a stream with > 100 shards, the account limit you mention can be easily increased to whatever you need it to be through AWS customer support. More details here. http://docs.aws.amazon.com/lambda/latest/dg/limits.html
hope that helps!