Multiple different consumers of same Kinesis stream - amazon-web-services

I have a Kinesis producer which writes a single type of message to a stream. I want to process this stream in multiple, completely different consumer applications. So, a pub/sub with a single publisher for a given topic/stream. I also want to make use of checkpointing to ensure that each consumer processes every message written to the stream.
Initially, I was using the same App Name for all consumers and producers. However, I started getting the following error once I started more than one consumer:
com.amazonaws.services.kinesis.model.InvalidArgumentException: StartingSequenceNumber 49564236296344566565977952725717230439257668853369405442 used in GetShardIterator on shard shardId-000000000000 in stream PackageCreated under account ************ is invalid because it did not come from this stream. (Service: AmazonKinesis; Status Code: 400; Error Code: InvalidArgumentException; Request ID: ..)
This seems to be because consumers are clashing with their checkpointing as they are using the same App Name.
From reading the documentation, it seems the only way to do pub/sub with checkpointing is by having a stream per consumer application, which requires each producer to know about all possible consumers. This is more tightly coupled than I want; it's really just a queue.
It seems like Kafka supports what I want: arbitrary consumption of a given topic/partition, since consumers are completely in control of their own checkpointing. Is my only option to move to Kafka, or some other alternative, if I want pub/sub with checkpointing?
My RecordProcessor code, which is identical in each consumer:
override def processRecords(processRecordsInput: ProcessRecordsInput): Unit = {
log.trace("Received record(s) from kinesis")
for {
record <- processRecordsInput.getRecords
json <- jawn.parseByteBuffer(record.getData).toOption
msg <- decode[T](json.toString).toOption
} yield subscriber ! msg
processRecordsInput.getCheckpointer.checkpoint()
}
The code parses the message and sends it off to the subscriber. For now, I'm simply marking all messages as successfully received. I can see messages being sent on the AWS Kinesis dashboard, but no reads happen, presumably because each application has its own AppName and doesn't see any other messages.

The pattern you want, that of one publisher to & multiple consumers from one Kinesis stream, is supported. You don't need a separate stream per consumer.
How do you do that? You need to give a different application-name to every consumer. That way, checkpointing info of one consumer won't collide with that of another.
Check the first response to this: https://forums.aws.amazon.com/message.jspa?messageID=554375

Related

AWS SQS Selective Polling Pattern

I have a system where I publish updates to a shared topic meant for specific consumers.
I noticed messages getting stuck in the queue due to a lack of selective listening in SQS consumers, so messages are being hijacked.
Example:
Given: Message{destination: A, payload: 1234}
Given: ConsumerA, & ConsumerB
I expect Message to be processed by ConsumerA. However, it gets hijacked by Consumer B continuously. It receives the message, then refuses to process it since the destination field doesn't match, leading to the visibility timeout to expire, and the message put back on the queue.. but due to the nature of SQS, ConsumerB has an equal chance of picking the message again.
My question is, what patterns are used to solve this type of issue?
I'm considering creating a queue per consumer but it has drawbacks specific to the system im working on.
If I could only listen for messages with matching attributes, problem solved, but that's seemingly not the case.
Is there any other way?
Sharing a single Amazon SQS queue is not an appropriate architecture for your use-case.
If you want your consumers to be able to 'request' a message from a particular subset, you should either use separate SQS queues or use a database. You could even store objects in Amazon S3 as a form of noSQL database.
Having consumers grab messages and then 'send them back' to the queue is not compatible with the design of the Amazon SQS service.

What is the difference between Kinesis and SQS?

I know there is a lot materials online for this question, however I have not found any that can explain this question quite clearly to a rookie like me... Appreciate it if some one can help me understand the key differences between these two services and use cases with real life examples. Thank you!
Amazon SQS is a queue. The basic process is:
Messages are sent to the queue. They stay there for up to 14 days.
Worker programs can request a message (or up to 10 messages) from the queue.
When a message is retrieved from the queue:
It stays in the queue but is marked as invisible
When the worker has finished processing the message, it tells SQS to delete the message from the queue
If the worker does not delete the message within the queue's invisibility timeout period, then the message reappears on the queue for another worker to process
The worker can, if desired, periodically tell SQS to keep a message invisible because it is still being processed
Thus, once a message is processed, it is deleted.
In Amazon Kinesis, a message is sent to a stream. The stream is divided into shards (think of them as mini-streams). When a message is received, Kinesis stores the message in sequential order. Then, workers can request a message from the start of the stream, or from a specific spot in the stream. For example, if it has already processed 5 messages, it can ask for the 6th message. The messages are retained in the stream for a period of time (eg 24 hours).
I like to think of it like a film strip — each frame in a film is kept in order. You can play a film from the start, or you can fast-forward to the middle and start playing from there. In addition, you can rewind to an earlier part and watch it. The same is true for a Kinesis stream, and multiple consumers can read from various parts of the stream simultaneously.
So, which to choose?
If a message is used once and then discarded, a queue is probably the better choice.
If retaining message order is important and/or messages will be used more than once, then a stream is probably better.
This article sums it up pretty nicely, imo:
https://sookocheff.com/post/aws/comparing-kinesis-and-sqs/
but basically, if you don't know which one you need, start with SQS until it can't do what you want. SQS is dead-simple to setup and use, and requires almost no experise to use it well.
Kinesis takes a lot more time and expertise to setup to use, so unless you need it, don't bother - even though it could be used for many of the same things as SQS.
One big difference, with SQS if you have multiple consumers reading from the queue, than each consumer will only ever see thge messages they consume - because other consumers will be blocked from seeing them; with Kinesis, many consumers can access the stream at the same time, and each consumer sees the entire streem - so SQS is good for taking a large number of tasks and doling out pieces to lots of consumers to work on in parallel (among other things), where as with Kinesis multiple consumers could read and see the entire streem and do something with ALL of the data in the stream.
The linked article explains it better than me.
I try to give a simple answer based on my practical experience:
Consider SQS as temporary storage service. Use cases:
manage data with different queue priorities
store data for a limited period of time
Lambda DLQ
reduce costs with long polling
create a FIFO
Consider Kinesis as a collector of large stream of real-time data. Use cases:
very very large stream of data from different sources
backup of data just enabling Firehose (you get a data lake for free)
get statistics at once during the collecting phase integrating Kinesis Analytics
have checkpoints to keep track in DynamoDB of records processed/failed
Note: consider that both services can be integrated with Lambda Functions very easily, so there are a plenty of use cases that can be solved both with SQS and Kinesis. Anyway, I tried to list some use cases where I found that one of the two performed peculiarly better than the other. Hope it can be helpful :)

Single SQS Queue vs Multiple SQS Queue while creating a Async Model

I have to develop a component where the Apis are async in nature. In order to develop this async model, I am going to use Aws SQS queues for publishing messages and the client will read from the queue and send the response back into the queue. Now there are 10 APIs (currently) that I have to expose.
Currently, I can think of having a single request and a single response queue (which I will poll) for all the APIs and the payload of the APIs can be defined by some Operation.
The other way is to use a separate queue for each API. The advantage that I can see for multiple queues is that each API can have different traffic and having multiple queues can help the client of the queues to scale effectively.
What can be other pros or cons for both the approaches?
Separate your use-case into 2 distinct problems:
Problem 1: APIs to Workers, one queue or multiple?
If your workers do different types of work, then having a single queue will require them to inspect then discard messages they don't care about. If this is the case, then you should have one queue per message type. This way, any message a worker receives from the queue, it should be able to handle.
If you start ignoring messages, then other workers, who may be idle, may be waiting for a while for messages it cares about.
Problem 2: Using a return queue for the "results". If your clients will be polling for results, then at each poll, your API will need to poll the queue. Again, it will be "searching" for the right response, discarding those it doesn't care about, starving other clients.
Recommendation:
Use multiple queues, one per "worker type". Workers should be able to process any message it receives from the queue.
Then use something other than SQS to store the result. One option is to use S3 to store the result:
When your API "creates" the task, create an object in S3 and put a reference to that S3 object on your SQS queue.
Your worker will do the work, then put the result where it was told to.
When your client polls your API for the result, your API will check S3 and return the status/results.
Instead of S3, other data stores could be used if appropriate: RDS, DynamoDB, etc.

Kinesis stream pending message count

I am trying to use AWS Kinesis stream for one of our data streams. I would like to monitor pending messages on my stream for ops purposes(scale downstream according to backlog), but unable to find any API that gives (approx) pending messages in my stream.
This looks strange as messages get expired after 7 days and if the producers and consumers are isolated and can't communicate, how do you know messages are expiring. How do you handle this problem?
Thanks!
There is no such concept as "pending" message in Kinesis. All the incoming data will be placed on a shard.
Your consumer application should be in running state all the time, to keep track of changes in your stream. The application (with the help of KCL) will continue to poll "Shard Iterator" in the background, thus you will be notified about the new data when it comes.
Roughly; you can see Kinesis as a FIFO queue and the messages will disappear in a short time if you don't pop them.
If your application will process a few messages in an hour, you should think about changing your architecture. Kinesis is probably not the correct tool for you.

Using Amazon SQS with multiple consumers

I have a service-based application that uses Amazon SQS with multiple queues and multiple consumers. I am doing this so that I can implement an event-based architecture and decouple all the services, where the different services react to changes in state of other systems. For example:
Registration Service:
Emits event 'registration-new' when a new user registers.
User Service:
Emits event 'user-updated' when user is updated.
Search Service:
Reads from queue 'registration-new' and indexes user in search.
Reads from queue 'user-updated' and updates user in search.
Metrics Service:
Reads from 'registration-new' queue and sends to Mixpanel.
Reads from queue 'user-updated' and sends to Mixpanel.
I'm having a number of issues:
A message can be received multiple times when doing polling. I can design a lot of the systems to be idempotent, but for some services (such as the metrics service) that would be much more difficult.
A message needs to be manually deleted from the queue in SQS. I have thought of implementing a "message-handling-service" that handles the deletion of messages when all the services have received them (each service would emit a 'message-acknowledged' event after handling a message).
I guess my question is this: what patterns should I use to ensure that I can have multiple consumers for a single queue in SQS, while ensuring that the messages also get delivered and deleted reliably. Thank you for your help.
I think you are doing it wrong.
It looks to me like you are using the same queue to do multiple different things. You are better of using a single queue for a single purpose.
Instead of putting an event into the 'registration-new' queue and then having two different services poll that queue, and BOTH needing to read that message and both doing something different with it (and then needing a 3rd process that is supposed to delete that message after the other 2 have processed it).
One queue should be used for one purpose.
Create a 'index-user-search' queue and a 'send to mixpanels' queue,
so the search service reads from the search queues, indexes the user
and immediately deletes the message.
The mixpanel-service reads from the mix-panels queue, processes the
message and deletes the message.
The registration service, instead of emiting a 'registration-new' to a single queue, now emits it to two queues.
To take it one step better, add SNS into the mix here and have the registration service emit an SNS message to the 'registration-new' topic (not queue), and then subscribe both of the queues I mentioned above, to that topic in a 'fan-out' pattern.
https://aws.amazon.com/blogs/aws/queues-and-notifications-now-best-friends/
Both queues will receive the message, but you only load it into SNS once - if down the road a 3rd unrelated service needs to also process 'registration-new' events, you create another queue and subscribe it to the topic as well - it can run with no dependencies or knowledge of what the other services are doing - that is the goal.
The primary use-case for multiple consumers of a queue is scaling-out.
The mechanism that allows for multiple consumers is the Visibility Timeout, which gives a consumer time to process and delete a message without it being consumed concurrently by another consumer.
To address the "At-Least-Once Delivery" property of Standard Queues,
the consuming service should be idempotent.
If that isn't possible, one possible solution is to use FIFO queues, but this mode has a limited message delivery rate and is not compatible with SNS subscription.
They even have a tutorial on how to create a fanout scenario using the combo SNS+SQS.
https://aws.amazon.com/getting-started/tutorials/send-fanout-event-notifications/
Too bad it does not support FIFO queues so you have to be careful to handle out of order messages.
It would be nice if they had a consistent hashing solution to have multiple competing consumers while respecting the message order.