Kafka message codec - compress and decompress - compression

When using kafka, I can set a codec by setting the kafka.compression.codec property of my kafka producer.
Suppose I use snappy compression in my producer, when consuming the messages from kafka using some kafka-consumer, should I do something to decode the data from snappy or is it some built-in feature of kafka consumer?
In the relevant documentation I could not find any property that relates to encoding in kafka consumer (it only relates to the producer).
Can someone clear this?

As per my understanding goes the de-compression is taken care by the Consumer it self. As mentioned in their official wiki page
The consumer iterator transparently decompresses compressed data and only returns an uncompressed message
As found in this article the way consumer works is as follows
The consumer has background “fetcher” threads that continuously fetch data in batches of 1MB from the brokers and add it to an internal blocking queue. The consumer thread dequeues data from this blocking queue, decompresses and iterates through the messages
And also in the doc page under End-to-end Batch Compression its written that
A batch of messages can be clumped together compressed and sent to the server in this form. This batch of messages will be written in compressed form and will remain compressed in the log and will only be decompressed by the consumer.
So it appears that the decompression part is handled in the consumer it self all you need to do is to provide the valid / supported compression type using the compression.codec ProducerConfig attribute while creating the producer. I couldn't find any example or explanation where it says any approach for decompression in the consumer end. Please correct me if I am wrong.

I have the same issue with v0.8.1 and this compression decomression in Kafka is poorly documented other than saying the Consumer should "transparently" decompresses compressed data which it NEVER did.
The example high level consumer client using ConsumerIterator in Kafka web site only works with uncompressed data. Once I enable compression in Producer client, the message never gets into the following "while" loop. Hopefully they should fix this issue asap or they shouldn't claim this feature as some users may use Kafka to transport large size message that needs batching and compression capabilities.
ConsumerIterator <byte[], byte[]> it = stream.iterator();
while(it.hasNext())
{
String message = new String(it.next().message());
}

Related

Kinesis Producer callback functions - guaranteed delivery?

Streaming to Kinesis billions of messages a day.
We're looking for an implementation that would allow us to deliver messages to Kinesis with exactly-once guarantee.
Our producer framework requires a streaming sink to be idempotent for exactly-once delivery guarantee, which Kinesis is not. So we're getting at-least once deliveries currently. (duplicates are possible and we do see them, when a streaming micro-batch has to restart for whatever reason on the producer side)
We started looking at Kinesis Producer Library (KPL) callback functions. Basically we would be tracking state of what messages were delivered and what not in DynamoDB based on a key that's present in each message. And if we know that a message was already sent, we will skip it for delivery re-attempt. Then it seems exactly-once is possible.. with two concerns:
1)
The only question we have - how likely it is we would lose a invocation of the callback function (e.g. network glitch etc), or the callback function itself has failed (e.g. we ran into a DynamoDB limit/ outage etc) -- is this documented somewhere? I know the chances are not high, but we want to design a system that would be resilient to some expected things like these.
2)
Timing. Let's say if for whatever reason Kinesis invoked a callback function with a delay (5-15 milliseconds would be enough to break some assumptions in the above callback functions that persists delivery state in DynamoDB). And while we haven't received a confirmation on the delivery, our streaming producer framework has attempted redelivery that it thinks wasn't yet delivery. Any workarounds for this potential issue?
ps. We know that one way to workaround, is to make dedups on an application side (receiver from that kinesis stream), but that's outside of our project and we have a hard requirement to get exactly-once into that Kinesis stream.
For #1, any path you go down you'll find yourself in edge cases that could lead you to loss of data, or duplicate calls. Even using a two phased commit protocol doesn't work here if the consumer isn't participating in that protocol.
For #2, Kinesis is ordered, so if you do get duplicates you should be able to reliably assume they will be on the same shard, and thus not processed while another reader is still processing (assuming one reader per shard). Just make sure you are using a strongly consistent read when calling DynamoDB.

What is the difference between Kinesis and SQS?

I know there is a lot materials online for this question, however I have not found any that can explain this question quite clearly to a rookie like me... Appreciate it if some one can help me understand the key differences between these two services and use cases with real life examples. Thank you!
Amazon SQS is a queue. The basic process is:
Messages are sent to the queue. They stay there for up to 14 days.
Worker programs can request a message (or up to 10 messages) from the queue.
When a message is retrieved from the queue:
It stays in the queue but is marked as invisible
When the worker has finished processing the message, it tells SQS to delete the message from the queue
If the worker does not delete the message within the queue's invisibility timeout period, then the message reappears on the queue for another worker to process
The worker can, if desired, periodically tell SQS to keep a message invisible because it is still being processed
Thus, once a message is processed, it is deleted.
In Amazon Kinesis, a message is sent to a stream. The stream is divided into shards (think of them as mini-streams). When a message is received, Kinesis stores the message in sequential order. Then, workers can request a message from the start of the stream, or from a specific spot in the stream. For example, if it has already processed 5 messages, it can ask for the 6th message. The messages are retained in the stream for a period of time (eg 24 hours).
I like to think of it like a film strip — each frame in a film is kept in order. You can play a film from the start, or you can fast-forward to the middle and start playing from there. In addition, you can rewind to an earlier part and watch it. The same is true for a Kinesis stream, and multiple consumers can read from various parts of the stream simultaneously.
So, which to choose?
If a message is used once and then discarded, a queue is probably the better choice.
If retaining message order is important and/or messages will be used more than once, then a stream is probably better.
This article sums it up pretty nicely, imo:
https://sookocheff.com/post/aws/comparing-kinesis-and-sqs/
but basically, if you don't know which one you need, start with SQS until it can't do what you want. SQS is dead-simple to setup and use, and requires almost no experise to use it well.
Kinesis takes a lot more time and expertise to setup to use, so unless you need it, don't bother - even though it could be used for many of the same things as SQS.
One big difference, with SQS if you have multiple consumers reading from the queue, than each consumer will only ever see thge messages they consume - because other consumers will be blocked from seeing them; with Kinesis, many consumers can access the stream at the same time, and each consumer sees the entire streem - so SQS is good for taking a large number of tasks and doling out pieces to lots of consumers to work on in parallel (among other things), where as with Kinesis multiple consumers could read and see the entire streem and do something with ALL of the data in the stream.
The linked article explains it better than me.
I try to give a simple answer based on my practical experience:
Consider SQS as temporary storage service. Use cases:
manage data with different queue priorities
store data for a limited period of time
Lambda DLQ
reduce costs with long polling
create a FIFO
Consider Kinesis as a collector of large stream of real-time data. Use cases:
very very large stream of data from different sources
backup of data just enabling Firehose (you get a data lake for free)
get statistics at once during the collecting phase integrating Kinesis Analytics
have checkpoints to keep track in DynamoDB of records processed/failed
Note: consider that both services can be integrated with Lambda Functions very easily, so there are a plenty of use cases that can be solved both with SQS and Kinesis. Anyway, I tried to list some use cases where I found that one of the two performed peculiarly better than the other. Hope it can be helpful :)

Multiple different consumers of same Kinesis stream

I have a Kinesis producer which writes a single type of message to a stream. I want to process this stream in multiple, completely different consumer applications. So, a pub/sub with a single publisher for a given topic/stream. I also want to make use of checkpointing to ensure that each consumer processes every message written to the stream.
Initially, I was using the same App Name for all consumers and producers. However, I started getting the following error once I started more than one consumer:
com.amazonaws.services.kinesis.model.InvalidArgumentException: StartingSequenceNumber 49564236296344566565977952725717230439257668853369405442 used in GetShardIterator on shard shardId-000000000000 in stream PackageCreated under account ************ is invalid because it did not come from this stream. (Service: AmazonKinesis; Status Code: 400; Error Code: InvalidArgumentException; Request ID: ..)
This seems to be because consumers are clashing with their checkpointing as they are using the same App Name.
From reading the documentation, it seems the only way to do pub/sub with checkpointing is by having a stream per consumer application, which requires each producer to know about all possible consumers. This is more tightly coupled than I want; it's really just a queue.
It seems like Kafka supports what I want: arbitrary consumption of a given topic/partition, since consumers are completely in control of their own checkpointing. Is my only option to move to Kafka, or some other alternative, if I want pub/sub with checkpointing?
My RecordProcessor code, which is identical in each consumer:
override def processRecords(processRecordsInput: ProcessRecordsInput): Unit = {
log.trace("Received record(s) from kinesis")
for {
record <- processRecordsInput.getRecords
json <- jawn.parseByteBuffer(record.getData).toOption
msg <- decode[T](json.toString).toOption
} yield subscriber ! msg
processRecordsInput.getCheckpointer.checkpoint()
}
The code parses the message and sends it off to the subscriber. For now, I'm simply marking all messages as successfully received. I can see messages being sent on the AWS Kinesis dashboard, but no reads happen, presumably because each application has its own AppName and doesn't see any other messages.
The pattern you want, that of one publisher to & multiple consumers from one Kinesis stream, is supported. You don't need a separate stream per consumer.
How do you do that? You need to give a different application-name to every consumer. That way, checkpointing info of one consumer won't collide with that of another.
Check the first response to this: https://forums.aws.amazon.com/message.jspa?messageID=554375

Kinesis stream pending message count

I am trying to use AWS Kinesis stream for one of our data streams. I would like to monitor pending messages on my stream for ops purposes(scale downstream according to backlog), but unable to find any API that gives (approx) pending messages in my stream.
This looks strange as messages get expired after 7 days and if the producers and consumers are isolated and can't communicate, how do you know messages are expiring. How do you handle this problem?
Thanks!
There is no such concept as "pending" message in Kinesis. All the incoming data will be placed on a shard.
Your consumer application should be in running state all the time, to keep track of changes in your stream. The application (with the help of KCL) will continue to poll "Shard Iterator" in the background, thus you will be notified about the new data when it comes.
Roughly; you can see Kinesis as a FIFO queue and the messages will disappear in a short time if you don't pop them.
If your application will process a few messages in an hour, you should think about changing your architecture. Kinesis is probably not the correct tool for you.

How can i block writes on a pipeline for reconfiguration of the pipeline?

Im writing a server that regularly needs to change the format of the send/received messages. when this happens the server should send a notification that all future messages have the new format and read all received in the old format until the client sends his ack.
i thought about keeping a reference to the decoder shared by all pipelines and reconfigure it from the outside as needed. I'm worried about concurrency in this case.
how can i make sure that no writes are handled by the pipeline while
i'm working on the decoder?
and how to be sure that the notification
is the first message handled after reconfiguration?
the only other way i see is to send a "notification" object through the pipeline (by using channel.write), catch the object in the decoder and do the reconfig then while forwarding the notification message. In this case there shouldn't be any concurrency in the pipeline.
would this be the better/state of the art way to do this?
i decided to use the second way. A StateHandler catches ConfigurationEvents reconfigures the pipeline. Unfortunately this means that i can't be sure that all channels use the same configuration because race conditions between the reconfiguration and extremely young channels can happen. but i'm pretty sure this won't matter in my case.