multiple nodes for message processing Kafka

multiple nodes for message processing Kafka - concurrency

we have a spring boot app deployed on Kubernetes that processes messages: it reads from a Kafka topic and then it does some mappings and finally, it writes to Kafka topics
In order to achieve higher performance, we need to process the messages faster and hence we introduce multiple nodes of this spring boot app.
but I believe this would lead to a problem because:
The messages should be processed in order
the message contains a state
Is there any solution to keep the messages in order and to guarantee that a message already processed by a node wouldn't be processed by another and to resolve any other issues caused by the processing in multiple nodes.
Please feel free to address all possible solutions because we are building a POC.
does the use apache flink or spring-cloud-stream helpful for this matter?

When consuming messages from Kafka it is important to keep the concept of a Consumer Group in mind. This concept ensures that nodes that read from a Kafka topic and sharing the same Consumer Group will not interfere with each other. Whatever has been read by one of the consumers within the Consumer Group will not be read again by another consumer of the same Consumer Group.
In addition, applications reading and writing to Kafka scale with the number of partitions in a Kafka topic.
It would not have any impact if you have multiple nodes consuming a topic with only one partition, as one partition can only be read from a single consumer within a Consumer Group. You will find more information in the Kafka documentation on Consumers.
When you have a topic with more than one partition, ordering might become an issue. Kafka only guarantees the order within a partition.
Here is an excerpt of the Kafka documentation describing the interaction between consumer group and partitions:
By having a notion of parallelism—the partition—within the topics, Kafka is able to provide both ordering guarantees and load balancing over a pool of consumer processes. This is achieved by assigning the partitions in the topic to the consumers in the consumer group so that each partition is consumed by exactly one consumer in the group. By doing this we ensure that the consumer is the only reader of that partition and consumes the data in order. Since there are many partitions this still balances the load over many consumer instances. Note however that there cannot be more consumer instances in a consumer group than partitions.

The limit to scaling up with Flink will be the number of partitions in your Kafka topic -- in other words, each instance of Flink's Kafka consumer will connect to and read from one or more partitions. With Flink, the ordering will be preserved unless you re-partition the data. Flink does provide exactly-once guarantees.
A quick way to experience Flink and Kafka in action together is explore Flink's operations playground. This dockerized playground is set up to let you explore rescaling, failure recovery, etc., and should make all this much more concrete.

You can run several consumer threads in a single application or even run several applications with several consumer threads. When all consumers belongs to the same group and Kafka topic has enough partitions Kafka will do balancing among topic partitions.
Messages in one partition are always ordered but to keep an order by the message key you should set max.in.flight.requests.per.connection=1. The broker always writes messages with the same key in the same partition (unless you change the partition number), so you will have all messages with the same key ordered.
One partition is readed by the only one consumer so the only way when another consumer gets processed messages is partitions rebalance before the message has ben acknowledged. You can set ack-mode=MANUAL_IMMEDIATE and acknowledge a message immediately after processing or use other acknowledge methods.
I'd recommend to read this article https://medium.com/#felipedutratine/kafka-ordering-guarantees-99320db8f87f

Related

Scaling my Kinesis consumers when consumption is slow

Assume I have a AWS Kinesis stream with 2 shards.
Therefore I have two consumers consuming from each shard.
There are large number of entries in the queues and my consumer is consuming it slowly.
To solve this I can go with Kafka consumer group like approach ,create two consumer applications and consume records.
But I want to know whether I can Reshard my queue (which will distribute the records across shards) and add consumers for that shards.
i.e After resharding my stream will have 4 shards and hence 4 consumers.
This will also increase the consumption and solve my problem.
Whats the pros and cons of the second approach as the second approach is generally suggested when the queue has ingestion issue?

Adding shards to increase consuming capacity is exactly what kinesis is about. It will increase the parallelism of your "consumer application" and should be pretty seamless.
Note that aws recently introduced a serverless kinesis model where you don't have to care about the shard count anymore. It's pretty much equivalent to letting aws care about the number of shards by itself so you don't have to worry about that anymore.

How to limit the number of unprocessed records for AWS Kinesis?

For example, RabbitMQ has a way in setting queue limits. If that limit is reached the new messages from publishers will be rejected, thus applying some kind of backpressure that starts from consumers to the producers. (since messages in queues means not processed by consumers).
Is there a way to assure this kind of behavior for brokers like Kinesis in which the consumers are allowed to pull messages and not the broker pushes to them, like RabbitMQ.
In case of Kinesis, similar to Kafka, the state of the consumers, offset of consumption and so on, is kept in a different entity, DynamoDB for Kinesis and I know this can be trickier to have something like unprocessed records limits out of the box.
Does anyone know if there is some settings you can use, maybe by the use of KCL / KPL client library, or something ?

No. AWS Kinesis does not provide the feature you want unfortunately. There is no way to stop producer writing into a Kinesis stream if the consumer cannot catch up in processing.
In fact this is one of the advantage of using Kinesis, it allows unlimited buffering of data up to the configured retention time for free. The only time it provides back pressure is when the producer writes too much data too fast because of the Amazon Kinesis API limit: https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html
If you want a limited size "queue", maybe you want to look into AWS SQS where it has a lower limit of 12000 inflight messages?
If you do want to use Kinesis, you might want to build a custom solution to feed the consumer delay back to the producer. For example, implement custom logic in the producer to monitor the consumer delay ('MillisBehindLatest') using AWS Cloudwatch (See https://docs.aws.amazon.com/streams/latest/dev/monitoring-with-kcl.html) and stop when the consumer is falling behind.

SQS fanout with DLQ

I am looking at using SNS-SQS services to deliver updates to machines running the same service. Since the plan is not for machines to communicate with each other, I was planning on creating a SQS for each machine (SQS would be created at startup).
I am however, not sure how to use a Dead-Letter Queue (DLQ) in such case. Should each SQS have its own DLQ or can I have common one which is shared across my SQS in the region? The concern I have with former approach is too many queues would be created (2x machines) and the concern with later is potential multiple copies of same message in the queue.
What is the best practice and recommended approach when using multiple SQS queues?

I wouldn't be concerned with the number of queues - they don't cost anything - so it really depends on how you plan on using the items in the dead-letter queue. I'll make the assumption that you will have some sort of process to review items in the DLQ to figure out why they were not processed before expiring.
Without knowing the details of what you plan to do, I would think a single DLQ would be better, and if you need to periodically process DLQ records, the processing app/system only needs to monitor that single queue.
Can't see the advantage of multiple DLQs in this case, at least based on your question.

As you are planning on doing a fanout process, having multiple queues is not a harm as long as they are used for asynchronous processing. Else a single queue is preferred. A fanout process in generally used when you want to process a few tasks concurrently by dividing it among several queues and working on them separately. (to read more about fanout)
The purpose of a Dead Letter Queue (DLQ) is to store messages that cannot be completed successfully by a certain process. Unless your process has a major fault, the number of elements that will be stored in a DLQ should be very less. Therefore it is okay to go ahead and use one DLQ for all other SQSs.
Having multiple DLQs will bring an overhead, where several processes will have to poll the DLQs for failed messages. Having just one DLQ can reduce this overhead.
It is recommended to use multiple DLQs if you want to store different categories of the failed messages.

Kafka like offset on Kinesis Stream?

I have worked a bit with Kafka in the past and lately there is a requirement to port part of the data pipeline on AWS Kinesis Stream. Now I have read that Kinesis is effectively a fork of Kafka and share many similarities.
However I have failed to see how can we have multiple consumers reading from the same stream, each with their corresponding offset. There is a sequence number given to each data record, but I couldn't find anything specific to consumer(Kafka group Id?).
Is it really possible to have different consumers with different ingestion rate over same AWS Kinesis Stream?

Yes.
You can have multiple Kinesis Consumer Applications. Let's say you have 2.
First consumer application (I think it is "consumer group" in Kafka?) can be "first-app" and store it's positions in the DynamoDB "first-app-table". It can have as many nodes (ec2 instances) as you want.
Second consumer application can also work on the same stream, and store it's positions on another DynamoDB table let's say "second-app-table".
Each table will contain "what is the last processed position on shard X for app Y" information. So the 2 applications store checkpoints for the same shards in a different place, which makes them independent.
About the ingestion rate, there is a "idleTimeBetweenReadsInMillis" value in consumer applications using KCL, that is the polling interval for Amazon Kinesis API for Get operations. For example first application can have "2000" poll interval, so it will poll stream's shards every 2 seconds to see if any new record came.
I don't know Kafka well but as far as I remember; Kafka "partition" is "shard" in Kinesis, likewise Kafka "offset" is "sequence number" in Kinesis. Kinesis Consumer Library uses the term "checkpoint" for the stored sequences. Like you said, the concepts are similar.

Why should I use Amazon Kinesis and not SNS-SQS?

I have a use case where there will be stream of data coming and I cannot consume it at the same pace and need a buffer. This can be solved using an SNS-SQS queue. I came to know the Kinesis solves the same purpose, so what is the difference? Why should I prefer (or should not prefer) Kinesis?

Keep in mind this answer was correct for Jun 2015
After studying the issue for a while, having the same question in mind, I found that SQS (with SNS) is preferred for most use cases unless the order of the messages is important to you (SQS doesn't guarantee FIFO on messages).
There are 2 main advantages for Kinesis:
you can read the same message from several applications
you can re-read messages in case you need to.
Both advantages can be achieved by using SNS as a fan out to SQS. That means that the producer of the message sends only one message to SNS, Then the SNS fans-out the message to multiple SQSs, one for each consumer application. In this way you can have as many consumers as you want without thinking about sharding capacity.
Moreover, we added one more SQS that is subscribed to the SNS that will hold messages for 14 days. In normal case no one reads from this SQS but in case of a bug that makes us want to rewind the data we can easily read all the messages from this SQS and re-send them to the SNS. While Kinesis only provides a 7 days retention.
In conclusion, SNS+SQSs is much easier and provides most capabilities. IMO you need a really strong case to choose Kinesis over it.

On the surface they are vaguely similar, but your use case will determine which tool is appropriate. IMO, if you can get by with SQS then you should - if it will do what you want, it will be simpler and cheaper, but here is a better explanation from the AWS FAQ which gives examples of appropriate use-cases for both tools to help you decide:
FAQ's

Semantics of these technologies are different because they were designed to support different scenarios:
SNS/SQS: the items in the stream are not related to each other
Kinesis: the items in the stream are related to each other
Let's understand the difference by example.
Suppose we have a stream of orders, for each order we need to reserve some stock and schedule a delivery. Once this is complete, we can safely remove the item from the stream and start processing the next order. We are fully done with the previous order before we start the next one.
Again, we have the same stream of orders, but now our goal is to group orders by destinations. Once we have, say, 10 orders to the same place, we want to deliver them together (delivery optimization). Now the story is different: when we get a new item from the stream, we cannot finish processing it; rather we "wait" for more items to come in order to meet our goal. Moreover, if the processor process crashes, we must "restore" the state (so no order will be lost).
Once processing of one item cannot be separated from processing another one, we must have Kinesis semantics in order to handle all the cases safely.

Kinesis support multiple consumers capabilities that means same data records can be processed at a same time or different time within 24 hrs at different consumers, similar behavior in SQS can be achieved by writing into multiple queues and consumers can read from multiple queues. However writing again into multiple queue will add sub seconds {few milliseconds} latency in system.
Second, Kinesis provides routing capability to selective route data records to different shards using partition key which can be processed by particular EC2 instances and can enable micro batch calculation {Counting & aggregation}.
Working on any AWS software is easy but with SQS is easiest one. With Kinesis, there is a need to provision enough shards ahead of time, dynamically increasing number of shards to manage spike load and decrease to save cost also required to manage. it's pain in Kinesis, No such things are required with SQS. SQS is infinitely scalable.

Excerpt from AWS Documentation:
We recommend Amazon Kinesis Streams for use cases with requirements that are similar to the following:
Routing related records to the same record processor (as in streaming MapReduce). For example, counting and aggregation are simpler when all records for a given key are routed to the same record processor.
Ordering of records. For example, you want to transfer log data from the application host to the processing/archival host while maintaining the order of log statements.
Ability for multiple applications to consume the same stream concurrently. For example, you have one application that updates a real-time dashboard and another that archives data to Amazon Redshift. You want both applications to consume data from the same stream concurrently and independently.
Ability to consume records in the same order a few hours later. For example, you have a billing application and an audit application that runs a few hours behind the billing application. Because Amazon Kinesis Streams stores data for up to 7 days, you can run the audit application up to 7 days behind the billing application.
We recommend Amazon SQS for use cases with requirements that are similar to the following:
Messaging semantics (such as message-level ack/fail) and visibility timeout. For example, you have a queue of work items and want to track the successful completion of each item independently. Amazon SQS tracks the ack/fail, so the application does not have to maintain a persistent checkpoint/cursor. Amazon SQS will delete acked messages and redeliver failed messages after a configured visibility timeout.
Individual message delay. For example, you have a job queue and need to schedule individual jobs with a delay. With Amazon SQS, you can configure individual messages to have a delay of up to 15 minutes.
Dynamically increasing concurrency/throughput at read time. For example, you have a work queue and want to add more readers until the backlog is cleared. With Amazon Kinesis Streams, you can scale up to a sufficient number of shards (note, however, that you'll need to provision enough shards ahead of time).
Leveraging Amazon SQS’s ability to scale transparently. For example, you buffer requests and the load changes as a result of occasional load spikes or the natural growth of your business. Because each buffered request can be processed independently, Amazon SQS can scale transparently to handle the load without any provisioning instructions from you.

The biggest advantage for me is the fact that Kinesis is a replayable queue, and SQS is not. So you can have multiple consumers of the same messages of Kinesis (or the same consumer at different times) where with SQS, once a message has been ack'd, it's gone from that queue.
SQS is better for worker queues because of that.

Another thing: Kinesis can trigger a Lambda, while SQS cannot. So with SQS you either have to provide an EC2 instance to process SQS messages (and deal with it if it fails), or you have to have a scheduled Lambda (which doesn't scale up or down - you get just one per minute).
Edit: This answer is no longer correct. SQS can directly trigger Lambda as of June 2018
https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html

The pricing models are different, so depending on your use case one or the other may be cheaper. Using the simplest case (not including SNS):
SQS charges per message (each 64 KB counts as one request).
Kinesis charges per shard per hour (1 shard can handle up to 1000 messages or 1 MB/second) and also for the amount of data you put in (every 25 KB).
Plugging in the current prices and not taking into account the free tier, if you send 1 GB of messages per day at the maximum message size, Kinesis will cost much more than SQS ($10.82/month for Kinesis vs. $0.20/month for SQS). But if you send 1 TB per day, Kinesis is somewhat cheaper ($158/month vs. $201/month for SQS).
Details: SQS charges $0.40 per million requests (64 KB each), so $0.00655 per GB. At 1 GB per day, this is just under $0.20 per month; at 1 TB per day, it comes to a little over $201 per month.
Kinesis charges $0.014 per million requests (25 KB each), so $0.00059 per GB. At 1 GB per day, this is less than $0.02 per month; at 1 TB per day, it is about $18 per month. However, Kinesis also charges $0.015 per shard-hour. You need at least 1 shard per 1 MB per second. At 1 GB per day, 1 shard will be plenty, so that will add another $0.36 per day, for a total cost of $10.82 per month. At 1 TB per day, you will need at least 13 shards, which adds another $4.68 per day, for a total cost of $158 per month.

Kinesis solves the problem of map part in a typical map-reduce scenario for streaming data. While SQS doesnt make sure of that. If you have streaming data that needs to be aggregated on a key, kinesis makes sure that all the data for that key goes to a specific shard and the shard can be consumed on a single host making the aggregation on key easier compared to SQS

Kinesis Use Cases
Log and Event Data Collection
Real-time Analytics
Mobile Data Capture
“Internet of Things” Data Feed
SQS Use Cases
Application integration
Decoupling microservices
Allocate tasks to multiple worker nodes
Decouple live user requests from intensive background work
Batch messages for future processing

I'll add one more thing nobody else has mentioned -- SQS is several orders of magnitude more expensive.

In very simple terms, and keeping costs out of the picture, the real intention of SNS-SQS are to make services loosely coupled. And this is only primary reason to use SQS where the order of the msgs are not so important and where you have more control of the messages. If you want a pattern of job queue using an SQS is again much better. Kinesis shouldn't be used be used in such cases because it is difficult to remove messages from kinesis because kinesis replays the whole batch on error. You can also use SQS as a dead letter queue for more control. With kinesis all these are possible but unheard of unless you are really critical of SQS.
If you want a nice partitioning then SQS won't be useful.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js