Reprocessing messages AWS kinesis data stream message using KCL consumer - amazon-web-services

A KCL consumer is running using Auto Scaling Group(ASG) configured according to the number of provisioned shards of the kinesis data stream on EC2 machines which means if the kinesis data stream has n number of provisioned shards then the maximum n number of EC2 machines can be configured to consume messages from each shard as per this document link
Now, Messages will be processed in real-time as soon as messages arrive in the kinesis data stream as the shard type iterator is set as LATEST for the KCL consumer. For more info check here.
A dynamo DB table is configured for a KCL consumer having entries of checkpoints for each provisioned shard to keep track of the shards in a kinesis data stream that is being leased and processed by the workers of the KCL consumer application.
If we want to process every message present in the kinesis data stream as per the data retention period of it (which is by default 7 days). Is there any simple and easy mechanism to do it?
Possible theoretical solution (can be incorrect or improved):
First Approach
Stop KCL consumer workers.
Delete the DynamoDB table associated with each provisioned shard so that workers start picking up the messages from the kinesis data stream.
Restart the KCL consumer service.
Second Approach
Stop the KCL consumer
Editing/Updating the checkpoint values for each shard related to previous/old timestamp. Any conversion formula? I don't know. Can we have any other dump value instead which will be overwritten by the KCL consumer?
Restart KCL consumer service
Any other approach?
Kindly feel free to suggest/comment on how can we reprocess kinesis data stream messages again effectively without any problem.

To reprocess all the stream data with your First approach you would need to change the type iterator from LATEST to TRIM_HORIZON before deleting the tables and restarting the KCL consumer, otherwise you would only process new arrivals to the stream.
The second approach is possible, you will need to get the shard-iterator for all the shards, using also the TRIM_HORIZON shard iterator type. There is also the possibility to indicate a timestamp in case you would need to reprocess less data than the retention of your stream. This aws reference documentation can be useful .

Related

How to guarantee to process Kinesis event stream serially when using paralleization factor?

Kinesis stream has only 1 shard and when creating Lambda, concurrent batches per shard for Kinesis stream source has been set as 10. When there is a spike in stream data, it will increase the concurrencies to 10. That means we will have 10 lambdas working in parallel. My question in this case is, how we can guarantee to process event stream serailly? It seems to me that it is impossible to do that because we can't control concurrencies. Can anyone have an idea for this? I can't get my head round.
AWS Lambda supports concurrent batch processing per shard and serial event processing, as long as all events in the Kinesis stream have the same partition key.
From AWS documentation:
You can also increase concurrency by processing multiple batches from each shard in parallel. Lambda can process up to 10 batches in each shard simultaneously. If you increase the number of concurrent batches per shard, Lambda still ensures in-order processing at the partition-key level.
References:
Using AWS Lambda with Amazon Kinesis (AWS)
Partition Key (Amazon Kinesis Data Streams Terminology and Concepts)

How to limit the number of unprocessed records for AWS Kinesis?

For example, RabbitMQ has a way in setting queue limits. If that limit is reached the new messages from publishers will be rejected, thus applying some kind of backpressure that starts from consumers to the producers. (since messages in queues means not processed by consumers).
Is there a way to assure this kind of behavior for brokers like Kinesis in which the consumers are allowed to pull messages and not the broker pushes to them, like RabbitMQ.
In case of Kinesis, similar to Kafka, the state of the consumers, offset of consumption and so on, is kept in a different entity, DynamoDB for Kinesis and I know this can be trickier to have something like unprocessed records limits out of the box.
Does anyone know if there is some settings you can use, maybe by the use of KCL / KPL client library, or something ?
No. AWS Kinesis does not provide the feature you want unfortunately. There is no way to stop producer writing into a Kinesis stream if the consumer cannot catch up in processing.
In fact this is one of the advantage of using Kinesis, it allows unlimited buffering of data up to the configured retention time for free. The only time it provides back pressure is when the producer writes too much data too fast because of the Amazon Kinesis API limit: https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html
If you want a limited size "queue", maybe you want to look into AWS SQS where it has a lower limit of 12000 inflight messages?
If you do want to use Kinesis, you might want to build a custom solution to feed the consumer delay back to the producer. For example, implement custom logic in the producer to monitor the consumer delay ('MillisBehindLatest') using AWS Cloudwatch (See https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-kcl.html) and stop when the consumer is falling behind.

Is it good or bad to creating a Kinesis on demand for cost optimisation?

Here is the scenario:
I have an X number of producers which generates the millions scheduled notifications (mail, SMS and push notification) at a different time of the day and night.
I'm using the AWS Kinesis stream which collects all notification entries and using a triggered base lambda function (which mapped to the AWS Kinesis) to process those entries
Problem:
Need to keep a Kinesis Stream awake all the time
Possible solution:
Create a kinesis on demand.
For example. The producer will predict the load size and according to some predefined algo, a producer will create a Kinesis with X number of shards.
Once the job is done then do the cleanup.
Question is does the above approach looks good and cost-effective?
Is there any technical challenges or blockers

Kafka like offset on Kinesis Stream?

I have worked a bit with Kafka in the past and lately there is a requirement to port part of the data pipeline on AWS Kinesis Stream. Now I have read that Kinesis is effectively a fork of Kafka and share many similarities.
However I have failed to see how can we have multiple consumers reading from the same stream, each with their corresponding offset. There is a sequence number given to each data record, but I couldn't find anything specific to consumer(Kafka group Id?).
Is it really possible to have different consumers with different ingestion rate over same AWS Kinesis Stream?
Yes.
You can have multiple Kinesis Consumer Applications. Let's say you have 2.
First consumer application (I think it is "consumer group" in Kafka?) can be "first-app" and store it's positions in the DynamoDB "first-app-table". It can have as many nodes (ec2 instances) as you want.
Second consumer application can also work on the same stream, and store it's positions on another DynamoDB table let's say "second-app-table".
Each table will contain "what is the last processed position on shard X for app Y" information. So the 2 applications store checkpoints for the same shards in a different place, which makes them independent.
About the ingestion rate, there is a "idleTimeBetweenReadsInMillis" value in consumer applications using KCL, that is the polling interval for Amazon Kinesis API for Get operations. For example first application can have "2000" poll interval, so it will poll stream's shards every 2 seconds to see if any new record came.
I don't know Kafka well but as far as I remember; Kafka "partition" is "shard" in Kinesis, likewise Kafka "offset" is "sequence number" in Kinesis. Kinesis Consumer Library uses the term "checkpoint" for the stored sequences. Like you said, the concepts are similar.

AWS Lambda Limits when processing Kinesis Stream

Can someone explain what happens to events when a Lambda is subscribed to Kinesis item create events. There is a limit of 100 concurrent requests for an account in AWS, so if 1,000,000 items are added to kinesis how are the events handled, are they queued up for the next available concurrent lambda?
From the FAQ http://aws.amazon.com/lambda/faqs/
"Q: How does AWS Lambda process data from Amazon Kinesis streams and Amazon DynamoDB Streams?
The Amazon Kinesis and DynamoDB Streams records sent to your AWS Lambda function are strictly serialized, per shard. This means that if you put two records in the same shard, Lambda guarantees that your Lambda function will be successfully invoked with the first record before it is invoked with the second record. If the invocation for one record times out, is throttled, or encounters any other error, Lambda will retry until it succeeds (or the record reaches its 24-hour expiration) before moving on to the next record. The ordering of records across different shards is not guaranteed, and processing of each shard happens in parallel."
What this means is if you have 1M items added to Kinesis, but only one shard, the throttle doesn't matter - you will only have one Lambda function instance reading off that shard in serial, based on the batch size you specified. The more shards you have, the more concurrent invocations your function will see. If you have a stream with > 100 shards, the account limit you mention can be easily increased to whatever you need it to be through AWS customer support. More details here. http://docs.aws.amazon.com/lambda/latest/dg/limits.html
hope that helps!