How to guarantee to process Kinesis event stream serially when using paralleization factor? - concurrency

Kinesis stream has only 1 shard and when creating Lambda, concurrent batches per shard for Kinesis stream source has been set as 10. When there is a spike in stream data, it will increase the concurrencies to 10. That means we will have 10 lambdas working in parallel. My question in this case is, how we can guarantee to process event stream serailly? It seems to me that it is impossible to do that because we can't control concurrencies. Can anyone have an idea for this? I can't get my head round.

AWS Lambda supports concurrent batch processing per shard and serial event processing, as long as all events in the Kinesis stream have the same partition key.
From AWS documentation:
You can also increase concurrency by processing multiple batches from each shard in parallel. Lambda can process up to 10 batches in each shard simultaneously. If you increase the number of concurrent batches per shard, Lambda still ensures in-order processing at the partition-key level.
References:
Using AWS Lambda with Amazon Kinesis (AWS)
Partition Key (Amazon Kinesis Data Streams Terminology and Concepts)

Related

Reprocessing messages AWS kinesis data stream message using KCL consumer

A KCL consumer is running using Auto Scaling Group(ASG) configured according to the number of provisioned shards of the kinesis data stream on EC2 machines which means if the kinesis data stream has n number of provisioned shards then the maximum n number of EC2 machines can be configured to consume messages from each shard as per this document link
Now, Messages will be processed in real-time as soon as messages arrive in the kinesis data stream as the shard type iterator is set as LATEST for the KCL consumer. For more info check here.
A dynamo DB table is configured for a KCL consumer having entries of checkpoints for each provisioned shard to keep track of the shards in a kinesis data stream that is being leased and processed by the workers of the KCL consumer application.
If we want to process every message present in the kinesis data stream as per the data retention period of it (which is by default 7 days). Is there any simple and easy mechanism to do it?
Possible theoretical solution (can be incorrect or improved):
First Approach
Stop KCL consumer workers.
Delete the DynamoDB table associated with each provisioned shard so that workers start picking up the messages from the kinesis data stream.
Restart the KCL consumer service.
Second Approach
Stop the KCL consumer
Editing/Updating the checkpoint values for each shard related to previous/old timestamp. Any conversion formula? I don't know. Can we have any other dump value instead which will be overwritten by the KCL consumer?
Restart KCL consumer service
Any other approach?
Kindly feel free to suggest/comment on how can we reprocess kinesis data stream messages again effectively without any problem.
To reprocess all the stream data with your First approach you would need to change the type iterator from LATEST to TRIM_HORIZON before deleting the tables and restarting the KCL consumer, otherwise you would only process new arrivals to the stream.
The second approach is possible, you will need to get the shard-iterator for all the shards, using also the TRIM_HORIZON shard iterator type. There is also the possibility to indicate a timestamp in case you would need to reprocess less data than the retention of your stream. This aws reference documentation can be useful .

Kinesis Batch size and Lambda

I have confusion about Kinesis Batch size and Lambda
suppose our batchsSize is 100 and Kinesis got 200 notifications. Then It will trigger 2 lambda threads?
It does not work like this by default. By default:
Lambda invokes a function with one batch of data records from one shard at a time.
To process multiple batches from same shard at the same time you have to setup:
Concurrent batches per shard – Process multiple batches from the same shard concurrently.

Multi destinations for Kinesis stream at same time

Can we have multiple destinations from single Kinesis Streams?
I am getting output in Splunk but now I also want to add an S3 bucket as the destination.
If I add another Amazon Kinesis Data Firehose, will it affect the performance of Splunk reading? Splunk pulls directly from Kinesis. If I add another destination will it affect Will it affects our current read and writes?
One of the benefits of using Kinesis is that you can do exactly this behaviour.
Each consumer application becomes responsible for which events it has read from the shard. There is no concept of an entry being processed already between 2 seperate applications.
One recommendation from AWS to bare in mind for high throughput for multiple consumers is to use enhanced fanout.
Each consumer registered to use enhanced fan-out receives its own read throughput per shard, up to 2 MB/sec, independently of other consumers.

Does AWS Lambda process DynamoDB stream events strictly in order?

I'm in the process of writing a Lambda function that processes items from a DynamoDB stream.
I thought part of the point behind Lambda was that if I have a large burst of events, it'll spin up enough instances to get through them concurrently, rather than feeding them sequentially through a single instance. As long as two events have a different key, I am fine with them being processed out of order.
However, I just read this page on Understanding Retry Behavior, which says:
For stream-based event sources (Amazon Kinesis Data Streams and DynamoDB streams), AWS Lambda polls your stream and invokes your Lambda function. Therefore, if a Lambda function fails, AWS Lambda attempts to process the erring batch of records until the time the data expires, which can be up to seven days for Amazon Kinesis Data Streams. The exception is treated as blocking, and AWS Lambda will not read any new records from the stream until the failed batch of records either expires or processed successfully. This ensures that AWS Lambda processes the stream events in order.
Does "AWS Lambda processes the stream events in order" mean Lambda cannot process multiple events concurrently? Is there any way to have it process events from distinct keys concurrently?
With AWS Lambda Supports Parallelization Factor for Kinesis and DynamoDB Event Sources, the order is still guaranteed for each partition key, but not necessarily within each shard when Concurrent batches per shard is set to be greater than 1. Therefore the accepted answer needs to be revised.
Stream records are organized into groups, or shards.
According to Lambda documentation, the concurrency is achieved on shard-level. Within each shard, the stream events are processed in order.
Stream-based event sources : for Lambda functions that process Kinesis
or DynamoDB streams the number of shards is the unit of concurrency.
If your stream has 100 active shards, there will be at most 100 Lambda
function invocations running concurrently. This is because Lambda
processes each shard’s events in sequence.
And according to Limits in DynamoDB,
Do not allow more than two processes to read from the same DynamoDB
Streams shard at the same time. Exceeding this limit can result in
request throttling.

AWS Lambda Limits when processing Kinesis Stream

Can someone explain what happens to events when a Lambda is subscribed to Kinesis item create events. There is a limit of 100 concurrent requests for an account in AWS, so if 1,000,000 items are added to kinesis how are the events handled, are they queued up for the next available concurrent lambda?
From the FAQ http://aws.amazon.com/lambda/faqs/
"Q: How does AWS Lambda process data from Amazon Kinesis streams and Amazon DynamoDB Streams?
The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel."
What this means is if you have 1M items added to Kinesis, but only one shard, the throttle doesn't matter - you will only have one Lambda function instance reading off that shard in serial, based on the batch size you specified. The more shards you have, the more concurrent invocations your function will see. If you have a stream with > 100 shards, the account limit you mention can be easily increased to whatever you need it to be through AWS customer support. More details here. http://docs.aws.amazon.com/lambda/latest/dg/limits.html
hope that helps!