I am trying to use AWS Kinesis stream for one of our data streams. I would like to monitor pending messages on my stream for ops purposes(scale downstream according to backlog), but unable to find any API that gives (approx) pending messages in my stream.
This looks strange as messages get expired after 7 days and if the producers and consumers are isolated and can't communicate, how do you know messages are expiring. How do you handle this problem?
Thanks!
There is no such concept as "pending" message in Kinesis. All the incoming data will be placed on a shard.
Your consumer application should be in running state all the time, to keep track of changes in your stream. The application (with the help of KCL) will continue to poll "Shard Iterator" in the background, thus you will be notified about the new data when it comes.
Roughly; you can see Kinesis as a FIFO queue and the messages will disappear in a short time if you don't pop them.
If your application will process a few messages in an hour, you should think about changing your architecture. Kinesis is probably not the correct tool for you.
Related
I know there is a lot materials online for this question, however I have not found any that can explain this question quite clearly to a rookie like me... Appreciate it if some one can help me understand the key differences between these two services and use cases with real life examples. Thank you!
Amazon SQS is a queue. The basic process is:
Messages are sent to the queue. They stay there for up to 14 days.
Worker programs can request a message (or up to 10 messages) from the queue.
When a message is retrieved from the queue:
It stays in the queue but is marked as invisible
When the worker has finished processing the message, it tells SQS to delete the message from the queue
If the worker does not delete the message within the queue's invisibility timeout period, then the message reappears on the queue for another worker to process
The worker can, if desired, periodically tell SQS to keep a message invisible because it is still being processed
Thus, once a message is processed, it is deleted.
In Amazon Kinesis, a message is sent to a stream. The stream is divided into shards (think of them as mini-streams). When a message is received, Kinesis stores the message in sequential order. Then, workers can request a message from the start of the stream, or from a specific spot in the stream. For example, if it has already processed 5 messages, it can ask for the 6th message. The messages are retained in the stream for a period of time (eg 24 hours).
I like to think of it like a film strip — each frame in a film is kept in order. You can play a film from the start, or you can fast-forward to the middle and start playing from there. In addition, you can rewind to an earlier part and watch it. The same is true for a Kinesis stream, and multiple consumers can read from various parts of the stream simultaneously.
So, which to choose?
If a message is used once and then discarded, a queue is probably the better choice.
If retaining message order is important and/or messages will be used more than once, then a stream is probably better.
This article sums it up pretty nicely, imo:
https://sookocheff.com/post/aws/comparing-kinesis-and-sqs/
but basically, if you don't know which one you need, start with SQS until it can't do what you want. SQS is dead-simple to setup and use, and requires almost no experise to use it well.
Kinesis takes a lot more time and expertise to setup to use, so unless you need it, don't bother - even though it could be used for many of the same things as SQS.
One big difference, with SQS if you have multiple consumers reading from the queue, than each consumer will only ever see thge messages they consume - because other consumers will be blocked from seeing them; with Kinesis, many consumers can access the stream at the same time, and each consumer sees the entire streem - so SQS is good for taking a large number of tasks and doling out pieces to lots of consumers to work on in parallel (among other things), where as with Kinesis multiple consumers could read and see the entire streem and do something with ALL of the data in the stream.
The linked article explains it better than me.
I try to give a simple answer based on my practical experience:
Consider SQS as temporary storage service. Use cases:
manage data with different queue priorities
store data for a limited period of time
Lambda DLQ
reduce costs with long polling
create a FIFO
Consider Kinesis as a collector of large stream of real-time data. Use cases:
very very large stream of data from different sources
backup of data just enabling Firehose (you get a data lake for free)
get statistics at once during the collecting phase integrating Kinesis Analytics
have checkpoints to keep track in DynamoDB of records processed/failed
Note: consider that both services can be integrated with Lambda Functions very easily, so there are a plenty of use cases that can be solved both with SQS and Kinesis. Anyway, I tried to list some use cases where I found that one of the two performed peculiarly better than the other. Hope it can be helpful :)
I have one Node.js application that pool messages from an AWS SQS queue and process these messages.
However some of these messages are not relevant to my service and they will be filtered out and not processed.
I am wondering if I can do this filtering stuffs from AWS before receiving these irrelevant messages ...
For example if the message does not have the following attribute data.name than this message will be deleted before reaching my application ...
Filtering these messages before sending them to to the queue is not possible (according to my client).
No that is not possible without polling the message itself. So you would need some other consumer polling the messages and returning them to queue (not calling DeleteMessage on the received delete handle) if they meet your requirements but that would be overkill in most of the cases, depending on the ratio of "good" and "bad" messages but still you would have to process "good" messages twice.
Better way would be to set up additional consumer and 2 queues. Producer sends messages to the first queue which is polled by the first consumer whose sole purpose is to filter messages and to send "good" messages to the second queue which would then be polled by your current consumer application. But again, this is much more costly.
If you can't filter messages before sending them to queue then filter them in your consuming application or you will have to pay some extra for this extra functionality.
I would like to publish a message on SQS and process that message after a few hours.
How can i schedule a message delivery or select messages from SQS based on some attribute?
I've implemented a SQS consumer but I'm receiving every message from SQS queue. Is possible to implement something like that on SQS? I was thinking about to receive every message and send to queue again if it's not time to process that message.
There is a feature called as Delay Queues in SQS, wherein if you set the delay on the queue then any message that is put on queue is available to consumers only after the delay duration has elapsed. However, the maximum delay that you can set there is 15 minutes and if you are looking for a delay of few hours this may not directly work for you.
The other option is to set a visibilty timeout for the messages higher than the delay time that you want. Then when you read the message you can get the message timestamp. If there is still some time left for your delay then you can sleep your consumer for the remaining time and after it has woken up you can process that message. However this is not a recommended way and would be highly inefficient because your threads are getting blocked. In fact what can as well be done is if there is still some time left for your delay then you just hold the message in a local List/Array and check for other messages and process this message after your delay. But all this would require entire logic to reside in your code and you don't get any ready-made feature from AWS
Is it possible to setup my SQS queue on AWS in order to process only once my message?
Maybe tweaking on long/short polling (is it going to have any impact on processing only once?)
or visibilityTimeout seconds,
or taking some best practice on my workers' application?
Or should I move definitely to a FIFO queue to be sure I have granted only once processing?
SQS will definitely process the message at least once but there a chance to process message more than once. Say you have a visibility timeout of 30 seconds and the consumer took 35 seconds to process the message then the message will again be available in the queue for other processes. If you don't have a problem with duplicate messages and expecting high throughput then SQS standard would be the right choice. Even you tweak with short polling or long polling you cannot guarantee that you can avoid duplication with SQS standard.
If you need to process message exactly once and if you strictly don't need any duplication then FIFO would be the right choice. Keep in mind throughput of FIFO wouldn't be that high as SQS standard. FIFO queues can support up to 300 messages per second
FIFO queues are designed to never introduce duplicate messages. However, your message producer might introduce duplicates in certain scenarios: for example, if the producer sends a message, does not receive a response, and then resends the same message. Amazon SQS APIs provide deduplication functionality that prevents your message producer from sending duplicates. Any duplicates introduced by the message producer are removed within a 5-minute deduplication interval.
Please read more about SQS standard here
Please read more about SQS FIFO here
I have a Kinesis producer which writes a single type of message to a stream. I want to process this stream in multiple, completely different consumer applications. So, a pub/sub with a single publisher for a given topic/stream. I also want to make use of checkpointing to ensure that each consumer processes every message written to the stream.
Initially, I was using the same App Name for all consumers and producers. However, I started getting the following error once I started more than one consumer:
com.amazonaws.services.kinesis.model.InvalidArgumentException: StartingSequenceNumber 49564236296344566565977952725717230439257668853369405442 used in GetShardIterator on shard shardId-000000000000 in stream PackageCreated under account ************ is invalid because it did not come from this stream. (Service: AmazonKinesis; Status Code: 400; Error Code: InvalidArgumentException; Request ID: ..)
This seems to be because consumers are clashing with their checkpointing as they are using the same App Name.
From reading the documentation, it seems the only way to do pub/sub with checkpointing is by having a stream per consumer application, which requires each producer to know about all possible consumers. This is more tightly coupled than I want; it's really just a queue.
It seems like Kafka supports what I want: arbitrary consumption of a given topic/partition, since consumers are completely in control of their own checkpointing. Is my only option to move to Kafka, or some other alternative, if I want pub/sub with checkpointing?
My RecordProcessor code, which is identical in each consumer:
override def processRecords(processRecordsInput: ProcessRecordsInput): Unit = {
log.trace("Received record(s) from kinesis")
for {
record <- processRecordsInput.getRecords
json <- jawn.parseByteBuffer(record.getData).toOption
msg <- decode[T](json.toString).toOption
} yield subscriber ! msg
processRecordsInput.getCheckpointer.checkpoint()
}
The code parses the message and sends it off to the subscriber. For now, I'm simply marking all messages as successfully received. I can see messages being sent on the AWS Kinesis dashboard, but no reads happen, presumably because each application has its own AppName and doesn't see any other messages.
The pattern you want, that of one publisher to & multiple consumers from one Kinesis stream, is supported. You don't need a separate stream per consumer.
How do you do that? You need to give a different application-name to every consumer. That way, checkpointing info of one consumer won't collide with that of another.
Check the first response to this: https://forums.aws.amazon.com/message.jspa?messageID=554375