I have worked a bit with Kafka in the past and lately there is a requirement to port part of the data pipeline on AWS Kinesis Stream. Now I have read that Kinesis is effectively a fork of Kafka and share many similarities.
However I have failed to see how can we have multiple consumers reading from the same stream, each with their corresponding offset. There is a sequence number given to each data record, but I couldn't find anything specific to consumer(Kafka group Id?).
Is it really possible to have different consumers with different ingestion rate over same AWS Kinesis Stream?
Yes.
You can have multiple Kinesis Consumer Applications. Let's say you have 2.
First consumer application (I think it is "consumer group" in Kafka?) can be "first-app" and store it's positions in the DynamoDB "first-app-table". It can have as many nodes (ec2 instances) as you want.
Second consumer application can also work on the same stream, and store it's positions on another DynamoDB table let's say "second-app-table".
Each table will contain "what is the last processed position on shard X for app Y" information. So the 2 applications store checkpoints for the same shards in a different place, which makes them independent.
About the ingestion rate, there is a "idleTimeBetweenReadsInMillis" value in consumer applications using KCL, that is the polling interval for Amazon Kinesis API for Get operations. For example first application can have "2000" poll interval, so it will poll stream's shards every 2 seconds to see if any new record came.
I don't know Kafka well but as far as I remember; Kafka "partition" is "shard" in Kinesis, likewise Kafka "offset" is "sequence number" in Kinesis. Kinesis Consumer Library uses the term "checkpoint" for the stored sequences. Like you said, the concepts are similar.
Related
A KCL consumer is running using Auto Scaling Group(ASG) configured according to the number of provisioned shards of the kinesis data stream on EC2 machines which means if the kinesis data stream has n number of provisioned shards then the maximum n number of EC2 machines can be configured to consume messages from each shard as per this document link
Now, Messages will be processed in real-time as soon as messages arrive in the kinesis data stream as the shard type iterator is set as LATEST for the KCL consumer. For more info check here.
A dynamo DB table is configured for a KCL consumer having entries of checkpoints for each provisioned shard to keep track of the shards in a kinesis data stream that is being leased and processed by the workers of the KCL consumer application.
If we want to process every message present in the kinesis data stream as per the data retention period of it (which is by default 7 days). Is there any simple and easy mechanism to do it?
Possible theoretical solution (can be incorrect or improved):
First Approach
Stop KCL consumer workers.
Delete the DynamoDB table associated with each provisioned shard so that workers start picking up the messages from the kinesis data stream.
Restart the KCL consumer service.
Second Approach
Stop the KCL consumer
Editing/Updating the checkpoint values for each shard related to previous/old timestamp. Any conversion formula? I don't know. Can we have any other dump value instead which will be overwritten by the KCL consumer?
Restart KCL consumer service
Any other approach?
Kindly feel free to suggest/comment on how can we reprocess kinesis data stream messages again effectively without any problem.
To reprocess all the stream data with your First approach you would need to change the type iterator from LATEST to TRIM_HORIZON before deleting the tables and restarting the KCL consumer, otherwise you would only process new arrivals to the stream.
The second approach is possible, you will need to get the shard-iterator for all the shards, using also the TRIM_HORIZON shard iterator type. There is also the possibility to indicate a timestamp in case you would need to reprocess less data than the retention of your stream. This aws reference documentation can be useful .
Can we have multiple destinations from single Kinesis Streams?
I am getting output in Splunk but now I also want to add an S3 bucket as the destination.
If I add another Amazon Kinesis Data Firehose, will it affect the performance of Splunk reading? Splunk pulls directly from Kinesis. If I add another destination will it affect Will it affects our current read and writes?
One of the benefits of using Kinesis is that you can do exactly this behaviour.
Each consumer application becomes responsible for which events it has read from the shard. There is no concept of an entry being processed already between 2 seperate applications.
One recommendation from AWS to bare in mind for high throughput for multiple consumers is to use enhanced fanout.
Each consumer registered to use enhanced fan-out receives its own read throughput per shard, up to 2 MB/sec, independently of other consumers.
we have a spring boot app deployed on Kubernetes that processes messages: it reads from a Kafka topic and then it does some mappings and finally, it writes to Kafka topics
In order to achieve higher performance, we need to process the messages faster and hence we introduce multiple nodes of this spring boot app.
but I believe this would lead to a problem because:
The messages should be processed in order
the message contains a state
Is there any solution to keep the messages in order and to guarantee that a message already processed by a node wouldn't be processed by another and to resolve any other issues caused by the processing in multiple nodes.
Please feel free to address all possible solutions because we are building a POC.
does the use apache flink or spring-cloud-stream helpful for this matter?
When consuming messages from Kafka it is important to keep the concept of a Consumer Group in mind. This concept ensures that nodes that read from a Kafka topic and sharing the same Consumer Group will not interfere with each other. Whatever has been read by one of the consumers within the Consumer Group will not be read again by another consumer of the same Consumer Group.
In addition, applications reading and writing to Kafka scale with the number of partitions in a Kafka topic.
It would not have any impact if you have multiple nodes consuming a topic with only one partition, as one partition can only be read from a single consumer within a Consumer Group. You will find more information in the Kafka documentation on Consumers.
When you have a topic with more than one partition, ordering might become an issue. Kafka only guarantees the order within a partition.
Here is an excerpt of the Kafka documentation describing the interaction between consumer group and partitions:
By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.
The limit to scaling up with Flink will be the number of partitions in your Kafka topic -- in other words, each instance of Flink's Kafka consumer will connect to and read from one or more partitions. With Flink, the ordering will be preserved unless you re-partition the data. Flink does provide exactly-once guarantees.
A quick way to experience Flink and Kafka in action together is explore Flink's operations playground. This dockerized playground is set up to let you explore rescaling, failure recovery, etc., and should make all this much more concrete.
You can run several consumer threads in a single application or even run several applications with several consumer threads. When all consumers belongs to the same group and Kafka topic has enough partitions Kafka will do balancing among topic partitions.
Messages in one partition are always ordered but to keep an order by the message key you should set max.in.flight.requests.per.connection=1. The broker always writes messages with the same key in the same partition (unless you change the partition number), so you will have all messages with the same key ordered.
One partition is readed by the only one consumer so the only way when another consumer gets processed messages is partitions rebalance before the message has ben acknowledged. You can set ack-mode=MANUAL_IMMEDIATE and acknowledge a message immediately after processing or use other acknowledge methods.
I'd recommend to read this article https://medium.com/#felipedutratine/kafka-ordering-guarantees-99320db8f87f
I have a server which can only process 20 request at a time. When lots of request coming, I want to store the request data, in some queues. and read a set of request (i.e 20) and process them by batch. What would be the ideal way to that ? Using SQS, or kinesis. I'm totally confused.
SQS = Simple Queue Service is for queuing messages in a 1:1 (once the message is consumed, it is removed from the queue)
Kinesis = low latency, high volumetry data streaming ... typically for 1:N (many consumers of messages)
As Kinesis is also storing the data for a period of time, both are often confused, but their architectural patterns are totally different.
Queue => SQS.
Data Streams => Kinesis.
Taken from https://aws.amazon.com/kinesis/data-streams/faqs/ :
Q: How does Amazon Kinesis Data Streams differ from Amazon SQS?
Amazon Kinesis Data Streams enables real-time processing of streaming
big data. It provides ordering of records, as well as the ability to
read and/or replay records in the same order to multiple Amazon
Kinesis Applications. The Amazon Kinesis Client Library (KCL) delivers
all records for a given partition key to the same record processor,
making it easier to build multiple applications reading from the same
Amazon Kinesis data stream (for example, to perform counting,
aggregation, and filtering).
Amazon Simple Queue Service (Amazon SQS) offers a reliable, highly
scalable hosted queue for storing messages as they travel between
computers. Amazon SQS lets you easily move data between distributed
application components and helps you build applications in which
messages are processed independently (with message-level ack/fail
semantics), such as automated workflows.
Q: When should I use Amazon Kinesis Data Streams, and when should I
use Amazon SQS?
We recommend Amazon Kinesis Data Streams for use cases with
requirements that are similar to the following:
Routing related records to the same record processor (as in streaming MapReduce). For example, counting and aggregation are
simpler when all records for a given key are routed to the same record
processor.
Ordering of records. For example, you want to transfer log data from the application host to the processing/archival host while maintaining
the order of log statements.
Ability for multiple applications to consume the same stream concurrently. For example, you have one application that updates a
real-time dashboard and another that archives data to Amazon Redshift.
You want both applications to consume data from the same stream
concurrently and independently.
Ability to consume records in the same order a few hours later. For example, you have a billing application and an audit application that
runs a few hours behind the billing application. Because Amazon
Kinesis Data Streams stores data for up to 7 days, you can run the
audit application up to 7 days behind the billing application.
We recommend Amazon SQS for use cases with requirements that are
similar to the following:
Messaging semantics (such as message-level ack/fail) and visibility timeout. For example, you have a queue of work items and want to track
the successful completion of each item independently. Amazon SQS
tracks the ack/fail, so the application does not have to maintain a
persistent checkpoint/cursor. Amazon SQS will delete acked messages
and redeliver failed messages after a configured visibility timeout.
Individual message delay. For example, you have a job queue and need to schedule individual jobs with a delay. With Amazon SQS, you can
configure individual messages to have a delay of up to 15 minutes.
Dynamically increasing concurrency/throughput at read time. For example, you have a work queue and want to add more readers until the
backlog is cleared. With Amazon Kinesis Data Streams, you can scale up
to a sufficient number of shards (note, however, that you'll need to
provision enough shards ahead of time).
Leveraging Amazon SQS’s ability to scale transparently. For example, you buffer requests and the load changes as a result of occasional
load spikes or the natural growth of your business. Because each
buffered request can be processed independently, Amazon SQS can scale
transparently to handle the load without any provisioning instructions
from you.
We want to use kinesis stream and firehose to update an aws managed elasticsearch cluster. We have hundreds of different indexes (corresponding to our DB shards) that need to be updated. When creating the firehose it requires that I specify the specific index name I want updated. Does that mean I need to create a separate firehose for each index in my cluster? Or is there a way to configure the firehose so it knows what index to used based on the content of the data.
Also, we would have 20 or so separate producers that would send data to a kinesis stream (each one of these producers would generate data for 10 different indexes). Would I also need a separate kinesis stream for each producer.
Summary:
20 producers (EC2 instances) -> Each producer sends data for 20 different indexes to a kinesis stream -> The kinesis stream then uses a firehose to update a single cluster which has 200 indexes in it.
Note: all of the indexes have the same mapping and name temple i.e. index_1, index_2...index_200
Edit: As we reindex the data we create new indexes along the lines of index_1-v2. Obviously we won't want to create a new firehose for each index version as they're being created. The new index name can be included in the JSON that's sent to the kinesis stream.
As you guessed, Firehose is the wrong solution for this problem, at least as stated. It is designed for situations where there's a 1:1 correspondence between stream (not producer!) and index. Things like clickstream data or log aggregation.
For any solution, you'll need to provide a mechanism to identify which index a record belongs to. You could do this by creating a separate Kinesis stream per message type (in which case you could use Firehose), but this would mean that your producers have to decide which stream to write each message to. That may cause unwanted complexity in your producers, and may also increase your costs unacceptably.
So, assuming that you want a single stream for all messages, you need a consumer application and some way to group those messages. You could include a message type (/ index name) in the record itself, or use the partition key for that purpose. The partition key makes for a somewhat easier implementation, as it guarantees that records for the same index will be stored on the same shard, but it means that your producers may be throttled.
For the consumer, you could use an always-on application that runs on EC2, or have the stream invoke a Lambda function.
Using Lambda is nice if you're using partition key to identify the message type, because each invocation only looks at a single shard (you may still have multiple partition keys in the invocation). On the downside, Lambda will poll the stream once per second, which may result in throttling if you have multiple stream consumers (with a stand-alone app you can control how often it polls the stream).