AWS Kinesis, concurrent Lambda processing with a guaranteed ordering - amazon-web-services

I have a Lambda with an Event source pointed to a Kinesis Stream Consumer (with an arbitrary number of shards)
I would like to ensure that items in the stream with the same 'partition key' are processed by Lambda in sequence and not simultaneously. ( This is being used as the object's identity, and I don't want multiple Lambdas performing logic on the same object simultaneously.)
For example, if the items in the stream have partition keys:
1,2,1,3,4,1,2,1
If we take the order of processing to be left to right, Lambda would process an item with each of the partition keys 1,2, 3 and 4 concurrently. Then, when it has finished an item with a specific partition key it can start processing another one with that key.
Is this achievable in some way, without the use of a distributed lock that would make inefficient use of Lambda?
Thanks

Items with the same 'partition key' will be processed by Lambda in sequence for stream event source mapping.
Moreover, you can specify 'concurrent batches per shard' when creating Lambda trigger:
If 'concurrent batches per shard' is 1 (default one), then the order will be preserved for the whole shard.
If 'concurrent batches per shard' is [2;10], then the order will be preserved only for records with the same partition key within the shard.
You can check about concurrent batches (ParallelizationFactor) in https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html

Seems like I was tackling the problem in the wrong way. Lambda guarantees that within a shard, the Lambda instance is invoked on one batch at a time. Therefore, there is no need for a distributed lock as at worst there will be multiple records belonging to the same entity in the same batch and processing them in order can be managed in-memory within the Lambda function itself.
Reference from the AWS FAQs http://aws.amazon.com/lambda/faqs/
Q: How does AWS Lambda process data from Amazon Kinesis streams and
Amazon DynamoDB Streams?
The Amazon Kinesis and DynamoDB Streams records sent to your AWS
Lambda function are strictly serialized, per shard. This means that if
you put two records in the same shard, Lambda guarantees that your
Lambda function will be successfully invoked with the first record
before it is invoked with the second record. If the invocation for one
record times out, is throttled, or encounters any other error, Lambda
will retry until it succeeds (or the record reaches its 24-hour
expiration) before moving on to the next record. The ordering of
records across different shards is not guaranteed, and processing of
each shard happens in parallel.

Related

Does AWS Lambda process DynamoDB stream events strictly in order?

I'm in the process of writing a Lambda function that processes items from a DynamoDB stream.
I thought part of the point behind Lambda was that if I have a large burst of events, it'll spin up enough instances to get through them concurrently, rather than feeding them sequentially through a single instance. As long as two events have a different key, I am fine with them being processed out of order.
However, I just read this page on Understanding Retry Behavior, which says:
For stream-based event sources (Amazon Kinesis Data Streams and DynamoDB streams), AWS Lambda polls your stream and invokes your Lambda function. Therefore, if a Lambda function fails, AWS Lambda attempts to process the erring batch of records until the time the data expires, which can be up to seven days for Amazon Kinesis Data Streams. The exception is treated as blocking, and AWS Lambda will not read any new records from the stream until the failed batch of records either expires or processed successfully. This ensures that AWS Lambda processes the stream events in order.
Does "AWS Lambda processes the stream events in order" mean Lambda cannot process multiple events concurrently? Is there any way to have it process events from distinct keys concurrently?
With AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sources, the order is still guaranteed for each partition key, but not necessarily within each shard when Concurrent batches per shard is set to be greater than 1. Therefore the accepted answer needs to be revised.
Stream records are organized into groups, or shards.
According to Lambda documentation, the concurrency is achieved on shard-level. Within each shard, the stream events are processed in order.
Stream-based event sources : for Lambda functions that process Kinesis
or DynamoDB streams the number of shards is the unit of concurrency.
If your stream has 100 active shards, there will be at most 100 Lambda
function invocations running concurrently. This is because Lambda
processes each shard’s events in sequence.
And according to Limits in DynamoDB,
Do not allow more than two processes to read from the same DynamoDB
Streams shard at the same time. Exceeding this limit can result in
request throttling.

How do DynamoDB streams distribute records to shards?

My goal is to ensure that records published by a DynamoDB stream are processed in the "correct" order. My table contains events for customers. Hash key is Event ID, range key a timestamp. "Correct" order would mean that events for the same customer ID are processed in order. Different customer IDs can be processed in parallel.
I'm consuming the stream via Lambda functions. Consumers are spawned automatically per shard. So if the runtime decides to shard the stream, consumption happens in parallel (if I get this right) and I run the risk of processing a CustomerAddressChanged event before CustomerCreated (for example).
The docs imply that there is no way to influence the sharding. But they don't say so explicitly. Is there a way, e.g., by using a combination of customer ID and timestamp for the range key?
The assumption that sharding is determined by table keys seems to be correct. My solution will be to use customer ID as hash key and timestamp (or event ID) as range key.
This AWS blog says:
The relative ordering of a sequence of changes made to a single
primary key will be preserved within a shard. Further, a given key
will be present in at most one of a set of sibling shards that are
active at a given point in time. As a result, your code can simply
process the stream records within a shard in order to accurately track
changes to an item.
This slide confirms it. I still wish the DynamoDB docs would explicitly say so...
I just had a response from AWS support. It seems to confirm #EagleBeak assumptions about partitions being mapped into shards. Or as I understand it, a partition is mapped to a shard tree.
My question was about REMOVE events due to TTL expiration, but it would apply to all other types of actions too.
Is a shard created per Primary Partition Key? and then if there are too many items in the same partition, the shard gets split into children?
A shard is created per partition in your DynamoDB table. If a
partition split is required due to too many items in the same
partition, the shard gets split into children as well. A shard might
split in response to high levels of write activity on its parent
table, so that applications can process records from multiple shards
in parallel.
https://aws.amazon.com/blogs/database/dynamodb-streams-use-cases-and-design-patterns/
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
Will those removed 100 items be put in just one shard provided they all have the same partition key?
Assuming all 100 items have the same partition key value (but
different sort key values), they would have been stored on the same
partition. Therefore, they would be removed from the same partition
and be put in the same shard.
Since "records sent to your AWS Lambda function are strictly serialized", how does this serialisation work in the case of TTL? Is
order within a shard established by partition/sort keys, TTL
expiration, etc.?
DynamoDB Streams captures a time-ordered sequence of item-level
modifications in your DynamoDB table. This time-ordered sequence is
preserved at a per shard level. In other words, the order within a
shard is established based on the order in which items were created,
updated or deleted.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
A dynamodb stream consists of stream records which are grouped into shards. A shard can spawn child shards in response to high number of writes on the dynamodb table. So you can have parent shards and possibly multiple child shards. To ensure that your application processes the records in the right sequence, the parent shard must always be processed before the child shards. This is described in detail in the docs.
Unfortunately, DynamoDB Streams records sent to AWS Lambda functions are strictly serialized, per shard and ordering of records across different shards is not guaranteed.
From the AWS Lamda FAQs:
Q: How does AWS Lambda process data from Amazon Kinesis streams and Amazon DynamoDB Streams?
The Amazon Kinesis and DynamoDB Streams records sent to your AWS
Lambda function are strictly serialized, per shard. This means that if
you put two records in the same shard, Lambda guarantees that your
Lambda function will be successfully invoked with the first record
before it is invoked with the second record. If the invocation for one
record times out, is throttled, or encounters any other error, Lambda
will retry until it succeeds (or the record reaches its 24-hour
expiration) before moving on to the next record. The ordering of
records across different shards is not guaranteed, and processing of
each shard happens in parallel.
If you use the DynamoDB Streams Kinesis Adapter, your application will process the shards and stream records in the correct order according to the DynamoDB documentation here. For more information on DynamoDB Streams Kinesis Adapter, see Using the DynamoDB Streams Kinesis Adapter to Process Stream Records.
So, using dynamodb lambda trigger won't guarantee ordering. Your other options include using the DynamoDB Streams Kinesis Adapter or the DynamoDB Streams Low-Level API which is a lot more work.

AWS DynamoDB trigger invocation speed and synchronization

Do triggers on DynamoDB tables have some sort of internal synchronization to keep everything in the order it is supposed to be?
Example: my trigger batch size is 1 and it's configured to always start reading from the latest entry. Two entries are made to the DB at one millisecond apart (or at the same time). I don't know the time it takes for the trigger and lambda function to be invoked but let's say for argument's sake it's longer than the time between DB entries (>1ms). Can I be sure that both lambda invocations don't receive the data from the second DB entry?
DynamoDB Streams doesn't send duplicates.
No, DynamoDB Streams is designed so that every update made to your
table will be represented exactly once in the stream.
DynamoDB Streams guarantees the following:
Each stream record appears exactly once in the stream. For each item
that is modified in a DynamoDB table, the stream records appear in the
same sequence as the actual modifications to the item.
DynamoDB Streams provides a time-ordered sequence of item-level changes made to data in a table.
As there is few milli-seconds difference between the update 1 and 2, Lambda should get two streams in the time ordered sequence (i.e. update 1 and then update 2).
Processing Streams Records on Lamdba:-
The Amazon Kinesis and DynamoDB Streams records sent to your AWS
Lambda function are strictly serialized, per shard. This means that if
you put two records in the same shard, Lambda guarantees that your
Lambda function will be successfully invoked with the first record
before it is invoked with the second record. If the invocation for one
record times out, is throttled, or encounters any other error, Lambda
will retry until it succeeds (or the record reaches its 24-hour
expiration) before moving on to the next record. The ordering of
records across different shards is not guaranteed, and processing of
each shard happens in parallel.
Stream-based event sources –
If you create a Lambda function that processes events from
stream-based services (Amazon Kinesis Streams or DynamoDB streams),
the number of shards per stream is the unit of concurrency. If your
stream has 100 active shards, there will be 100 Lambda functions
running concurrently. Then, each Lambda function processes events on a
shard in the order that they arrive.
Short Answer:-
Stream ensure that there is no duplicates. So there is no way that 2
Lambda invocations receive same data
Reg the processing of stream records i.e. whether the second update
processing starts after the first update processing depends on shard
per stream (unit of concurrency)
Because shards have a lineage (parent and children), applications must always process a parent shard before it processes a child shard. This will ensure that the stream records are also processed in the correct order. Use DynamoDB Streams Kinesis Adapter if you wanted to preserve the correct processing order.

How does AWS Lambda parallel execution works with DynamoDB?

I went through this article which says that the data records are organized into groups called Shards, and these shards can be consumed and processed in parallel by Lambda function.
I also found these slides from AWS webindar where on slide 22 you can also see that Lambda functions consume different shards in parallel.
However I could not achieve parallel execution of a single function. I created a simple lambda function that runs for a minute. Then I started to create tons of items in DynamoDB expecting to get a lot of stream records. In spite of this, my functions was started one after another.
What i'm doing wrong?
Pre-Context:
How DaynamoDB stores data?
DynamoDB uses partition to store the table records. These partitions are abstracted from users and managed by DynamoDB team. As data grows in the table, these partitions are further divided internally.
What these dynamo streams all about?
DynamoDB as a data-base provides a way for user to retrieve the ordered changed logs (think of it as transnational replay logs of traditional database). These are vended as Dynamo table streams.
How data is published in streams?
Stream has a concept of shards (which is somewhat similar to partition). Shards by definition contains ordered events. With dynamo terminology, a stream shard will contains the data from a certain partition.
Cool!.. So what will happen if data grows in table or frequent writes occurs?
Dynamo will keep persisting the records based on HashKey/SortKey in its associated partition, until a threshold is breached (like table size and/or RCU/WCU counts). The exact value of these thresholds are not shared to us by dynamoDB, Though we have some document around rough estimation.
As this threshold is breached, dynamo splits the partition and do the re-hashing to distribute the data (somewhat) evenly across the partition.
Since new partitions have arrived, these data will be published to its own shards (mapped to its partition)
Great, so what about Lambda? How the parallel processing works then.
One lambda function process records from one and only one shard. Thus the number of shards present in the dynamo stream will decide the number of parallel running lambda function.
Vaguely you can think of, # of partitions = # shards = # of parallel lambda running.
From the first article it is said:
Because shards have a lineage (parent and children), applications must always process a parent shard before it processes a child shard. This will ensure that the stream records are also processed in the correct order.
Yet, when working with Kinesis streams for example, you can achieve parallelism by having multiple shards as the order in which records are processed is guaranteed only within a shard.
Side note, it makes some sense to trigger lambda with Dynamodb events in order.

AWS Lambda Limits when processing Kinesis Stream

Can someone explain what happens to events when a Lambda is subscribed to Kinesis item create events. There is a limit of 100 concurrent requests for an account in AWS, so if 1,000,000 items are added to kinesis how are the events handled, are they queued up for the next available concurrent lambda?
From the FAQ http://aws.amazon.com/lambda/faqs/
"Q: How does AWS Lambda process data from Amazon Kinesis streams and Amazon DynamoDB Streams?
The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel."
What this means is if you have 1M items added to Kinesis, but only one shard, the throttle doesn't matter - you will only have one Lambda function instance reading off that shard in serial, based on the batch size you specified. The more shards you have, the more concurrent invocations your function will see. If you have a stream with > 100 shards, the account limit you mention can be easily increased to whatever you need it to be through AWS customer support. More details here. http://docs.aws.amazon.com/lambda/latest/dg/limits.html
hope that helps!