Is KCL for AWS kinesis processing thread safe?

Is KCL for AWS kinesis processing thread safe? - amazon-web-services

We have an application which process the data from kinesis and maintains some state for few seconds.we are afraid whether the maintained state can be affected by the multithreaded nature of KCL.
Can anybody tell us whether RecordProcessor from KCL is thread safe?.

KCL is a wrapper library around your custom logic that processes your records.
The purpose of the library is to manage the Kinesis side of things while you focus on the record processing logic. KCL will align you EC2 workers to a certain shard or shards (usually 1 EC2 worker to 1 shard) and maintain a DynamoDB table which stores the sequencing keys.
Your custom application logic is responsible for maintaining state and thread-safety.
By default, a list of Kinesis records (target size is defined by you) that you have picked up from your shard is passed to your code to be processed. You can do this sequentially or fork them to threads if you wish. Not until you return from this processing method will KCL request more records from the shard for you.

Related

Reprocessing messages AWS kinesis data stream message using KCL consumer

A KCL consumer is running using Auto Scaling Group(ASG) configured according to the number of provisioned shards of the kinesis data stream on EC2 machines which means if the kinesis data stream has n number of provisioned shards then the maximum n number of EC2 machines can be configured to consume messages from each shard as per this document link
Now, Messages will be processed in real-time as soon as messages arrive in the kinesis data stream as the shard type iterator is set as LATEST for the KCL consumer. For more info check here.
A dynamo DB table is configured for a KCL consumer having entries of checkpoints for each provisioned shard to keep track of the shards in a kinesis data stream that is being leased and processed by the workers of the KCL consumer application.
If we want to process every message present in the kinesis data stream as per the data retention period of it (which is by default 7 days). Is there any simple and easy mechanism to do it?
Possible theoretical solution (can be incorrect or improved):
First Approach
Stop KCL consumer workers.
Delete the DynamoDB table associated with each provisioned shard so that workers start picking up the messages from the kinesis data stream.
Restart the KCL consumer service.
Second Approach
Stop the KCL consumer
Editing/Updating the checkpoint values for each shard related to previous/old timestamp. Any conversion formula? I don't know. Can we have any other dump value instead which will be overwritten by the KCL consumer?
Restart KCL consumer service
Any other approach?
Kindly feel free to suggest/comment on how can we reprocess kinesis data stream messages again effectively without any problem.

To reprocess all the stream data with your First approach you would need to change the type iterator from LATEST to TRIM_HORIZON before deleting the tables and restarting the KCL consumer, otherwise you would only process new arrivals to the stream.
The second approach is possible, you will need to get the shard-iterator for all the shards, using also the TRIM_HORIZON shard iterator type. There is also the possibility to indicate a timestamp in case you would need to reprocess less data than the retention of your stream. This aws reference documentation can be useful .

multiple nodes for message processing Kafka

we have a spring boot app deployed on Kubernetes that processes messages: it reads from a Kafka topic and then it does some mappings and finally, it writes to Kafka topics
In order to achieve higher performance, we need to process the messages faster and hence we introduce multiple nodes of this spring boot app.
but I believe this would lead to a problem because:
The messages should be processed in order
the message contains a state
Is there any solution to keep the messages in order and to guarantee that a message already processed by a node wouldn't be processed by another and to resolve any other issues caused by the processing in multiple nodes.
Please feel free to address all possible solutions because we are building a POC.
does the use apache flink or spring-cloud-stream helpful for this matter?

When consuming messages from Kafka it is important to keep the concept of a Consumer Group in mind. This concept ensures that nodes that read from a Kafka topic and sharing the same Consumer Group will not interfere with each other. Whatever has been read by one of the consumers within the Consumer Group will not be read again by another consumer of the same Consumer Group.
In addition, applications reading and writing to Kafka scale with the number of partitions in a Kafka topic.
It would not have any impact if you have multiple nodes consuming a topic with only one partition, as one partition can only be read from a single consumer within a Consumer Group. You will find more information in the Kafka documentation on Consumers.
When you have a topic with more than one partition, ordering might become an issue. Kafka only guarantees the order within a partition.
Here is an excerpt of the Kafka documentation describing the interaction between consumer group and partitions:
By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.

The limit to scaling up with Flink will be the number of partitions in your Kafka topic -- in other words, each instance of Flink's Kafka consumer will connect to and read from one or more partitions. With Flink, the ordering will be preserved unless you re-partition the data. Flink does provide exactly-once guarantees.
A quick way to experience Flink and Kafka in action together is explore Flink's operations playground. This dockerized playground is set up to let you explore rescaling, failure recovery, etc., and should make all this much more concrete.

You can run several consumer threads in a single application or even run several applications with several consumer threads. When all consumers belongs to the same group and Kafka topic has enough partitions Kafka will do balancing among topic partitions.
Messages in one partition are always ordered but to keep an order by the message key you should set max.in.flight.requests.per.connection=1. The broker always writes messages with the same key in the same partition (unless you change the partition number), so you will have all messages with the same key ordered.
One partition is readed by the only one consumer so the only way when another consumer gets processed messages is partitions rebalance before the message has ben acknowledged. You can set ack-mode=MANUAL_IMMEDIATE and acknowledge a message immediately after processing or use other acknowledge methods.
I'd recommend to read this article https://medium.com/#felipedutratine/kafka-ordering-guarantees-99320db8f87f

SQS or Kinesis which one is good for queuing?

I have a server which can only process 20 request at a time. When lots of request coming, I want to store the request data, in some queues. and read a set of request (i.e 20) and process them by batch. What would be the ideal way to that ? Using SQS, or kinesis. I'm totally confused.

SQS = Simple Queue Service is for queuing messages in a 1:1 (once the message is consumed, it is removed from the queue)
Kinesis = low latency, high volumetry data streaming ... typically for 1:N (many consumers of messages)
As Kinesis is also storing the data for a period of time, both are often confused, but their architectural patterns are totally different.
Queue => SQS.
Data Streams => Kinesis.
Taken from https://aws.amazon.com/kinesis/data-streams/faqs/ :
Q: How does Amazon Kinesis Data Streams differ from Amazon SQS?
Amazon Kinesis Data Streams enables real-time processing of streaming
big data. It provides ordering of records, as well as the ability to
read and/or replay records in the same order to multiple Amazon
Kinesis Applications. The Amazon Kinesis Client Library (KCL) delivers
all records for a given partition key to the same record processor,
making it easier to build multiple applications reading from the same
Amazon Kinesis data stream (for example, to perform counting,
aggregation, and filtering).
Amazon Simple Queue Service (Amazon SQS) offers a reliable, highly
scalable hosted queue for storing messages as they travel between
computers. Amazon SQS lets you easily move data between distributed
application components and helps you build applications in which
messages are processed independently (with message-level ack/fail
semantics), such as automated workflows.
Q: When should I use Amazon Kinesis Data Streams, and when should I
use Amazon SQS?
We recommend Amazon Kinesis Data Streams for use cases with
requirements that are similar to the following:
Routing related records to the same record processor (as in streaming MapReduce). For example, counting and aggregation are
simpler when all records for a given key are routed to the same record
processor.
Ordering of records. For example, you want to transfer log data from the application host to the processing/archival host while maintaining
the order of log statements.
Ability for multiple applications to consume the same stream concurrently. For example, you have one application that updates a
real-time dashboard and another that archives data to Amazon Redshift.
You want both applications to consume data from the same stream
concurrently and independently.
Ability to consume records in the same order a few hours later. For example, you have a billing application and an audit application that
runs a few hours behind the billing application. Because Amazon
Kinesis Data Streams stores data for up to 7 days, you can run the
audit application up to 7 days behind the billing application.
We recommend Amazon SQS for use cases with requirements that are
similar to the following:
Messaging semantics (such as message-level ack/fail) and visibility timeout. For example, you have a queue of work items and want to track
the successful completion of each item independently. Amazon SQS
tracks the ack/fail, so the application does not have to maintain a
persistent checkpoint/cursor. Amazon SQS will delete acked messages
and redeliver failed messages after a configured visibility timeout.
Individual message delay. For example, you have a job queue and need to schedule individual jobs with a delay. With Amazon SQS, you can
configure individual messages to have a delay of up to 15 minutes.
Dynamically increasing concurrency/throughput at read time. For example, you have a work queue and want to add more readers until the
backlog is cleared. With Amazon Kinesis Data Streams, you can scale up
to a sufficient number of shards (note, however, that you'll need to
provision enough shards ahead of time).
Leveraging Amazon SQS’s ability to scale transparently. For example, you buffer requests and the load changes as a result of occasional
load spikes or the natural growth of your business. Because each
buffered request can be processed independently, Amazon SQS can scale
transparently to handle the load without any provisioning instructions
from you.

Is it good or bad to creating a Kinesis on demand for cost optimisation?

Here is the scenario:
I have an X number of producers which generates the millions scheduled notifications (mail, SMS and push notification) at a different time of the day and night.
I'm using the AWS Kinesis stream which collects all notification entries and using a triggered base lambda function (which mapped to the AWS Kinesis) to process those entries
Problem:
Need to keep a Kinesis Stream awake all the time
Possible solution:
Create a kinesis on demand.
For example. The producer will predict the load size and according to some predefined algo, a producer will create a Kinesis with X number of shards.
Once the job is done then do the cleanup.
Question is does the above approach looks good and cost-effective?
Is there any technical challenges or blockers

How does AWS Lambda parallel execution works with DynamoDB?

I went through this article which says that the data records are organized into groups called Shards, and these shards can be consumed and processed in parallel by Lambda function.
I also found these slides from AWS webindar where on slide 22 you can also see that Lambda functions consume different shards in parallel.
However I could not achieve parallel execution of a single function. I created a simple lambda function that runs for a minute. Then I started to create tons of items in DynamoDB expecting to get a lot of stream records. In spite of this, my functions was started one after another.
What i'm doing wrong?

Pre-Context:
How DaynamoDB stores data?
DynamoDB uses partition to store the table records. These partitions are abstracted from users and managed by DynamoDB team. As data grows in the table, these partitions are further divided internally.
What these dynamo streams all about?
DynamoDB as a data-base provides a way for user to retrieve the ordered changed logs (think of it as transnational replay logs of traditional database). These are vended as Dynamo table streams.
How data is published in streams?
Stream has a concept of shards (which is somewhat similar to partition). Shards by definition contains ordered events. With dynamo terminology, a stream shard will contains the data from a certain partition.
Cool!.. So what will happen if data grows in table or frequent writes occurs?
Dynamo will keep persisting the records based on HashKey/SortKey in its associated partition, until a threshold is breached (like table size and/or RCU/WCU counts). The exact value of these thresholds are not shared to us by dynamoDB, Though we have some document around rough estimation.
As this threshold is breached, dynamo splits the partition and do the re-hashing to distribute the data (somewhat) evenly across the partition.
Since new partitions have arrived, these data will be published to its own shards (mapped to its partition)
Great, so what about Lambda? How the parallel processing works then.
One lambda function process records from one and only one shard. Thus the number of shards present in the dynamo stream will decide the number of parallel running lambda function.
Vaguely you can think of, # of partitions = # shards = # of parallel lambda running.

From the first article it is said:
Because shards have a lineage (parent and children), applications must always process a parent shard before it processes a child shard. This will ensure that the stream records are also processed in the correct order.
Yet, when working with Kinesis streams for example, you can achieve parallelism by having multiple shards as the order in which records are processed is guaranteed only within a shard.
Side note, it makes some sense to trigger lambda with Dynamodb events in order.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js