My goal is to ensure that records published by a DynamoDB stream are processed in the "correct" order. My table contains events for customers. Hash key is Event ID, range key a timestamp. "Correct" order would mean that events for the same customer ID are processed in order. Different customer IDs can be processed in parallel.
I'm consuming the stream via Lambda functions. Consumers are spawned automatically per shard. So if the runtime decides to shard the stream, consumption happens in parallel (if I get this right) and I run the risk of processing a CustomerAddressChanged event before CustomerCreated (for example).
The docs imply that there is no way to influence the sharding. But they don't say so explicitly. Is there a way, e.g., by using a combination of customer ID and timestamp for the range key?
The assumption that sharding is determined by table keys seems to be correct. My solution will be to use customer ID as hash key and timestamp (or event ID) as range key.
This AWS blog says:
The relative ordering of a sequence of changes made to a single
primary key will be preserved within a shard. Further, a given key
will be present in at most one of a set of sibling shards that are
active at a given point in time. As a result, your code can simply
process the stream records within a shard in order to accurately track
changes to an item.
This slide confirms it. I still wish the DynamoDB docs would explicitly say so...
I just had a response from AWS support. It seems to confirm #EagleBeak assumptions about partitions being mapped into shards. Or as I understand it, a partition is mapped to a shard tree.
My question was about REMOVE events due to TTL expiration, but it would apply to all other types of actions too.
Is a shard created per Primary Partition Key? and then if there are too many items in the same partition, the shard gets split into children?
A shard is created per partition in your DynamoDB table. If a
partition split is required due to too many items in the same
partition, the shard gets split into children as well. A shard might
split in response to high levels of write activity on its parent
table, so that applications can process records from multiple shards
in parallel.
https://aws.amazon.com/blogs/database/dynamodb-streams-use-cases-and-design-patterns/
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
Will those removed 100 items be put in just one shard provided they all have the same partition key?
Assuming all 100 items have the same partition key value (but
different sort key values), they would have been stored on the same
partition. Therefore, they would be removed from the same partition
and be put in the same shard.
Since "records sent to your AWS Lambda function are strictly serialized", how does this serialisation work in the case of TTL? Is
order within a shard established by partition/sort keys, TTL
expiration, etc.?
DynamoDB Streams captures a time-ordered sequence of item-level
modifications in your DynamoDB table. This time-ordered sequence is
preserved at a per shard level. In other words, the order within a
shard is established based on the order in which items were created,
updated or deleted.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
A dynamodb stream consists of stream records which are grouped into shards. A shard can spawn child shards in response to high number of writes on the dynamodb table. So you can have parent shards and possibly multiple child shards. To ensure that your application processes the records in the right sequence, the parent shard must always be processed before the child shards. This is described in detail in the docs.
Unfortunately, DynamoDB Streams records sent to AWS Lambda functions are strictly serialized, per shard and ordering of records across different shards is not guaranteed.
From the AWS Lamda FAQs:
Q: How does AWS Lambda process data from Amazon Kinesis streams and Amazon DynamoDB Streams?
The Amazon Kinesis and DynamoDB Streams records sent to your AWS
Lambda function are strictly serialized, per shard. This means that if
you put two records in the same shard, Lambda guarantees that your
Lambda function will be successfully invoked with the first record
before it is invoked with the second record. If the invocation for one
record times out, is throttled, or encounters any other error, Lambda
will retry until it succeeds (or the record reaches its 24-hour
expiration) before moving on to the next record. The ordering of
records across different shards is not guaranteed, and processing of
each shard happens in parallel.
If you use the DynamoDB Streams Kinesis Adapter, your application will process the shards and stream records in the correct order according to the DynamoDB documentation here. For more information on DynamoDB Streams Kinesis Adapter, see Using the DynamoDB Streams Kinesis Adapter to Process Stream Records.
So, using dynamodb lambda trigger won't guarantee ordering. Your other options include using the DynamoDB Streams Kinesis Adapter or the DynamoDB Streams Low-Level API which is a lot more work.
Related
I have a Lambda with an Event source pointed to a Kinesis Stream Consumer (with an arbitrary number of shards)
I would like to ensure that items in the stream with the same 'partition key' are processed by Lambda in sequence and not simultaneously. ( This is being used as the object's identity, and I don't want multiple Lambdas performing logic on the same object simultaneously.)
For example, if the items in the stream have partition keys:
1,2,1,3,4,1,2,1
If we take the order of processing to be left to right, Lambda would process an item with each of the partition keys 1,2, 3 and 4 concurrently. Then, when it has finished an item with a specific partition key it can start processing another one with that key.
Is this achievable in some way, without the use of a distributed lock that would make inefficient use of Lambda?
Thanks
Items with the same 'partition key' will be processed by Lambda in sequence for stream event source mapping.
Moreover, you can specify 'concurrent batches per shard' when creating Lambda trigger:
If 'concurrent batches per shard' is 1 (default one), then the order will be preserved for the whole shard.
If 'concurrent batches per shard' is [2;10], then the order will be preserved only for records with the same partition key within the shard.
You can check about concurrent batches (ParallelizationFactor) in https://docs.aws.amazon.com/lambda/latest/dg/with-kinesis.html
Seems like I was tackling the problem in the wrong way. Lambda guarantees that within a shard, the Lambda instance is invoked on one batch at a time. Therefore, there is no need for a distributed lock as at worst there will be multiple records belonging to the same entity in the same batch and processing them in order can be managed in-memory within the Lambda function itself.
Reference from the AWS FAQs http://aws.amazon.com/lambda/faqs/
Q: How does AWS Lambda process data from Amazon Kinesis streams and
Amazon DynamoDB Streams?
The Amazon Kinesis and DynamoDB Streams records sent to your AWS
Lambda function are strictly serialized, per shard. This means that if
you put two records in the same shard, Lambda guarantees that your
Lambda function will be successfully invoked with the first record
before it is invoked with the second record. If the invocation for one
record times out, is throttled, or encounters any other error, Lambda
will retry until it succeeds (or the record reaches its 24-hour
expiration) before moving on to the next record. The ordering of
records across different shards is not guaranteed, and processing of
each shard happens in parallel.
We want to use kinesis stream and firehose to update an aws managed elasticsearch cluster. We have hundreds of different indexes (corresponding to our DB shards) that need to be updated. When creating the firehose it requires that I specify the specific index name I want updated. Does that mean I need to create a separate firehose for each index in my cluster? Or is there a way to configure the firehose so it knows what index to used based on the content of the data.
Also, we would have 20 or so separate producers that would send data to a kinesis stream (each one of these producers would generate data for 10 different indexes). Would I also need a separate kinesis stream for each producer.
Summary:
20 producers (EC2 instances) -> Each producer sends data for 20 different indexes to a kinesis stream -> The kinesis stream then uses a firehose to update a single cluster which has 200 indexes in it.
Note: all of the indexes have the same mapping and name temple i.e. index_1, index_2...index_200
Edit: As we reindex the data we create new indexes along the lines of index_1-v2. Obviously we won't want to create a new firehose for each index version as they're being created. The new index name can be included in the JSON that's sent to the kinesis stream.
As you guessed, Firehose is the wrong solution for this problem, at least as stated. It is designed for situations where there's a 1:1 correspondence between stream (not producer!) and index. Things like clickstream data or log aggregation.
For any solution, you'll need to provide a mechanism to identify which index a record belongs to. You could do this by creating a separate Kinesis stream per message type (in which case you could use Firehose), but this would mean that your producers have to decide which stream to write each message to. That may cause unwanted complexity in your producers, and may also increase your costs unacceptably.
So, assuming that you want a single stream for all messages, you need a consumer application and some way to group those messages. You could include a message type (/ index name) in the record itself, or use the partition key for that purpose. The partition key makes for a somewhat easier implementation, as it guarantees that records for the same index will be stored on the same shard, but it means that your producers may be throttled.
For the consumer, you could use an always-on application that runs on EC2, or have the stream invoke a Lambda function.
Using Lambda is nice if you're using partition key to identify the message type, because each invocation only looks at a single shard (you may still have multiple partition keys in the invocation). On the downside, Lambda will poll the stream once per second, which may result in throttling if you have multiple stream consumers (with a stand-alone app you can control how often it polls the stream).
Do triggers on DynamoDB tables have some sort of internal synchronization to keep everything in the order it is supposed to be?
Example: my trigger batch size is 1 and it's configured to always start reading from the latest entry. Two entries are made to the DB at one millisecond apart (or at the same time). I don't know the time it takes for the trigger and lambda function to be invoked but let's say for argument's sake it's longer than the time between DB entries (>1ms). Can I be sure that both lambda invocations don't receive the data from the second DB entry?
DynamoDB Streams doesn't send duplicates.
No, DynamoDB Streams is designed so that every update made to your
table will be represented exactly once in the stream.
DynamoDB Streams guarantees the following:
Each stream record appears exactly once in the stream. For each item
that is modified in a DynamoDB table, the stream records appear in the
same sequence as the actual modifications to the item.
DynamoDB Streams provides a time-ordered sequence of item-level changes made to data in a table.
As there is few milli-seconds difference between the update 1 and 2, Lambda should get two streams in the time ordered sequence (i.e. update 1 and then update 2).
Processing Streams Records on Lamdba:-
The Amazon Kinesis and DynamoDB Streams records sent to your AWS
Lambda function are strictly serialized, per shard. This means that if
you put two records in the same shard, Lambda guarantees that your
Lambda function will be successfully invoked with the first record
before it is invoked with the second record. If the invocation for one
record times out, is throttled, or encounters any other error, Lambda
will retry until it succeeds (or the record reaches its 24-hour
expiration) before moving on to the next record. The ordering of
records across different shards is not guaranteed, and processing of
each shard happens in parallel.
Stream-based event sources –
If you create a Lambda function that processes events from
stream-based services (Amazon Kinesis Streams or DynamoDB streams),
the number of shards per stream is the unit of concurrency. If your
stream has 100 active shards, there will be 100 Lambda functions
running concurrently. Then, each Lambda function processes events on a
shard in the order that they arrive.
Short Answer:-
Stream ensure that there is no duplicates. So there is no way that 2
Lambda invocations receive same data
Reg the processing of stream records i.e. whether the second update
processing starts after the first update processing depends on shard
per stream (unit of concurrency)
Because shards have a lineage (parent and children), applications must always process a parent shard before it processes a child shard. This will ensure that the stream records are also processed in the correct order. Use DynamoDB Streams Kinesis Adapter if you wanted to preserve the correct processing order.
Firehose->S3 uses the current date as a prefix for creating keys in S3. So this partitions the data by the time the record is written. My firehose stream contains events which have a specific event time.
Is there a way to create S3 keys containing this event time instead? Processing tools downstream depend on each event being in an "hour-folder" related to when it actually happened. Or would that have to be an additional processing step after Firehose is done?
The event time could be in the partition key or I could use a Lambda function to parse it from the record.
Kinesis Firehose doesn't (yet) allow clients to control how the date suffix of the final S3 objects is generated.
The only option with you is to add a post-processing layer after Kinesis Firehose. For e.g., you could schedule an hourly EMR job, using Data Pipeline, that reads all files written in last hour and publishes them to correct S3 destinations.
It's not an answer for the question, however I would like to explain a little bit the idea behind storing records in accordance with event arrival time.
First a few words about streams. Kinesis is just a stream of data. And it has a concept of consuming. One can reliable consume a stream only by reading it sequentially. And there is also an idea of checkpoints as a mechanism for pausing and resuming the consuming process. A checkpoint is just a sequence number which identifies a position in the stream. Via specifying this number, one can start reading the stream from the certain event.
And now go back to default s3 firehose setup... Since the capacity of kinesis stream is quite limited, most probably one needs to store somewhere the data from kinesis to analyze it later. And the firehose to s3 setup does this right out of the box. It just stores raw data from the stream to s3 buckets. But logically this data is the still the same stream of records. And to be able to reliable consume (read) this stream one needs these sequential numbers for checkpoints. And these numbers are records arrival times.
What if I want to read records by creation time? Looks like the proper way to accomplish this task is to read the s3 stream sequentially, dump it to some [time series] database or data warehouse and do creation-time-based readings against this storage. Otherwise there will be always a non-zero chance to miss some bunches of events while reading the s3 (stream). So I would not suggest the reordering of s3 buckets at all.
You'll need to do some post-processing or write a custom streaming consumer (such as Lambda) to do this.
We dealt with a huge event volume at my company, so writing a Lambda function didn't seem like a good use of money. Instead, we found batch-processing with Athena to be a really simple solution.
First, you stream into an Athena table, events, which can optionally be partitioned by an arrival-time.
Then, you define another Athena table, say, events_by_event_time which is partitioned by the event_time attribute on your event, or however it's been defined in the schema.
Finally, you schedule a process to run an Athena INSERT INTO query that takes events from events and automatically repartitions them to events_by_event_time and now your events are partitioned by event_time without requiring EMR, data pipelines, or any other infrastructure.
You can do this with any attribute on your events. It's also worth noting you can create a view that does a UNION of the two tables to query real-time and historic events.
I actually wrote more about this in a blog post here.
For future readers - Firehose supports Custom Prefixes for Amazon S3 Objects
https://docs.aws.amazon.com/firehose/latest/dev/s3-prefixes.html
AWS started offering "Dynamic Partitioning" in Aug 2021:
Dynamic partitioning enables you to continuously partition streaming data in Kinesis Data Firehose by using keys within data (for example, customer_id or transaction_id) and then deliver the data grouped by these keys into corresponding Amazon Simple Storage Service (Amazon S3) prefixes.
https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html
Look at https://docs.aws.amazon.com/firehose/latest/dev/dynamic-partitioning.html. You can implement a lambda function which takes your records, processes them, changes the partition key and then sends them back to firehose to be added. You would also have the change the firehose to enable this partitioning and also define your custom partition key/prefix/suffix.
I went through this article which says that the data records are organized into groups called Shards, and these shards can be consumed and processed in parallel by Lambda function.
I also found these slides from AWS webindar where on slide 22 you can also see that Lambda functions consume different shards in parallel.
However I could not achieve parallel execution of a single function. I created a simple lambda function that runs for a minute. Then I started to create tons of items in DynamoDB expecting to get a lot of stream records. In spite of this, my functions was started one after another.
What i'm doing wrong?
Pre-Context:
How DaynamoDB stores data?
DynamoDB uses partition to store the table records. These partitions are abstracted from users and managed by DynamoDB team. As data grows in the table, these partitions are further divided internally.
What these dynamo streams all about?
DynamoDB as a data-base provides a way for user to retrieve the ordered changed logs (think of it as transnational replay logs of traditional database). These are vended as Dynamo table streams.
How data is published in streams?
Stream has a concept of shards (which is somewhat similar to partition). Shards by definition contains ordered events. With dynamo terminology, a stream shard will contains the data from a certain partition.
Cool!.. So what will happen if data grows in table or frequent writes occurs?
Dynamo will keep persisting the records based on HashKey/SortKey in its associated partition, until a threshold is breached (like table size and/or RCU/WCU counts). The exact value of these thresholds are not shared to us by dynamoDB, Though we have some document around rough estimation.
As this threshold is breached, dynamo splits the partition and do the re-hashing to distribute the data (somewhat) evenly across the partition.
Since new partitions have arrived, these data will be published to its own shards (mapped to its partition)
Great, so what about Lambda? How the parallel processing works then.
One lambda function process records from one and only one shard. Thus the number of shards present in the dynamo stream will decide the number of parallel running lambda function.
Vaguely you can think of, # of partitions = # shards = # of parallel lambda running.
From the first article it is said:
Because shards have a lineage (parent and children), applications must always process a parent shard before it processes a child shard. This will ensure that the stream records are also processed in the correct order.
Yet, when working with Kinesis streams for example, you can achieve parallelism by having multiple shards as the order in which records are processed is guaranteed only within a shard.
Side note, it makes some sense to trigger lambda with Dynamodb events in order.