Chat bots: ensuring serial processing of messages on a per-conversation basis in clustered environment - concurrency

In the context of writing a Messenger chat bot in a cloud environment, I'm facing some concurrency issues.
Specifically, I would like to ensure that incoming messages from the same conversation are processed one after the other.
As a constraint, I'm processing the messages with workers in a Cloud environment (i.e the worker pool is of variable size and worker instances are potentially short-lived and may crash). Also, low latency is important.
So abstracting a little, my requirements are:
I have a stream of incoming messages
each of these messages has a 'topic key' (the conversation id)
the set of topics is not known ahead-of-time and is virtually infinite
I want to ensure that messages of the same topic are processed serially
on a cluster of potentially ephemeral workers
if possible, I would like reliability guarantees e.g making sure that each message is processed exactly once.
My questions are:
Is there a name for this concurrency scenario?.
Are there technologies (message brokers, coordination services, etc.) which implement this out of the box?
If not, what algorithms can I use to implement this on top of lower-level concurrency tools? (distributed locks, actors, queues, etc.)

I don't know of a widely-accepted name for the scenario, but a common strategy to solve that type of problem is to route your messages so that all messages with the same topic key end up at the same destination. A couple of technologies that will do this for you:
With Apache ActiveMQ, HornetQ, or Apache ActiveMQ Artemis, you could use your topic key as the JMSXGroupId to ensure all messages with the same topic key are processed in-order by the same consumer, with failover
With Apache Kafka, you could use your topic key as the partition key, which will also ensure all messages with the same topic key are processed in-order by the same consumer
Some message broker vendors refer to this requirement as Message Grouping, Sticky Sessions, or Sticky Message Load Balancing.
Another common strategy on messaging systems with weaker delivery/ordering guarantees (like Amazon SQS) is to simply include a sequence number in the message and leave it up to the destination to resequence and request redelivery of missing messages as needed.

I think you can fix this by using a queue and a set. What I can think of is sending every message object in queue and processing it as first in first out. But while adding it in queue add topic name in set and while taking it out for processing remove topic name from set.
So now if you have any topic in set then don't add another message object of same topic in queue.
I hope this will help you. All the best :)

Related

AWS SQS Selective Polling Pattern

I have a system where I publish updates to a shared topic meant for specific consumers.
I noticed messages getting stuck in the queue due to a lack of selective listening in SQS consumers, so messages are being hijacked.
Example:
Given: Message{destination: A, payload: 1234}
Given: ConsumerA, & ConsumerB
I expect Message to be processed by ConsumerA. However, it gets hijacked by Consumer B continuously. It receives the message, then refuses to process it since the destination field doesn't match, leading to the visibility timeout to expire, and the message put back on the queue.. but due to the nature of SQS, ConsumerB has an equal chance of picking the message again.
My question is, what patterns are used to solve this type of issue?
I'm considering creating a queue per consumer but it has drawbacks specific to the system im working on.
If I could only listen for messages with matching attributes, problem solved, but that's seemingly not the case.
Is there any other way?
Sharing a single Amazon SQS queue is not an appropriate architecture for your use-case.
If you want your consumers to be able to 'request' a message from a particular subset, you should either use separate SQS queues or use a database. You could even store objects in Amazon S3 as a form of noSQL database.
Having consumers grab messages and then 'send them back' to the queue is not compatible with the design of the Amazon SQS service.

How to stream events with GCP platform?

I am looking into building a simple solution where producer services push events to a message queue and then have a streaming service make those available through gRPC streaming API.
Cloud Pub/Sub seems well suited for the job however scaling the streaming service means that each copy of that service would need to create its own subscription and delete it before scaling down and that seems unnecessarily complicated and not what the platform was intended for.
On the other hand Kafka seems to work well for something like this but I'd like to avoid having to manage the underlying platform itself and instead leverage the cloud infrastructure.
I should also mention that the reason for having a streaming API is to allow for streaming towards a frontend (who may not have access to the underlying infrastructure)
Is there a better way to go about doing something like this with the GCP platform without going the route of deploying and managing my own infrastructure?
If you essentially want ephemeral subscriptions, then there are a few things you can set on the Subscription object when you create a subscription:
Set the expiration_policy to a smaller duration. When a subscriber is not receiving messages for that time period, the subscription will be deleted. The tradeoff is that if your subscriber is down due to a transient issue that lasts longer than this period, then the subscription will be deleted. By default, the expiration is 31 days. You can set this as low as 1 day. For pull subscribers, the subscribers simply need to stop issuing requests to Cloud Pub/Sub for the timer on their expiration to start. For push subscriptions, the timer starts based on when no messages are successfully delivered to the endpoint. Therefore, if no messages are published or if the endpoint is returning an error for all pushed messages, the timer is in effect.
Reduce the value of message_retention_duration. This is the time period for which messages are kept in the event a subscriber is not receiving messages and acking them. By default, this is 7 days. You can set it as low as 10 minutes. The tradeoff is that if your subscriber disconnects or gets behind in processing messages by more than this duration, messages older than that will be deleted and the subscriber will not see them.
Subscribers that cleanly shut down could probably just call DeleteSubscription themselves so that the subscription goes away immediately, but for ones that shut down unexpectedly, setting these two properties will minimize the time for which the subscription continues to exist and the number of messages (that will never get delivered) that will be retained.
Keep in mind that Cloud Pub/Sub quotas limit one to 10,000 subscriptions per topic and per project. Therefore, if a lot of subscriptions are created and either active or not cleaned up (manually, or automatically after expiration_policy's ttl has passed), then new subscriptions may not be able to be created.
I think your original idea was better than ephemeral subscriptions tbh. I mean it works, but it feels totally unnatural. Depending on what your requirements are. For example, do clients only need to receive messages while they're connected or do they all need to get all messages?
Only While Connected
Your original idea was better imo. What I probably would have done is to create a gRPC stream service that clients could connect to. The implementation is essentially an observer pattern. The consumer will receive a message and then iterate through the subscribers to do a "Send" to all of them. From there, any time a client connects to the service, it just registers itself with that observer collection and unregisters when it disconnects. Horizontal scaling is passive since clients are sticky to whatever instance they've connected to.
Everyone always get the message, if eventually
The concept is similar to the above but the client doesn't implicitly un-register from the observer on disconnect. Instead, it would register and un-register explicitly (through a method/command designed to do so). Modify the 'on disconnected' logic to tell the observer list that the client has gone offline. Then the consumer's broadcast logic is slightly different. Now it iterates through the list and says "if online, then send, else queue", and send the message to a ephemeral queue (that belongs to the client). Then your 'on connect' logic will send all messages that are in queue to the client before informing the consumer that it's back online. Basically an inbox. Setting up ephemeral, self-deleting queues is really easy in most products like RabbitMQ. I think you'll have to do a bit of managing whether or not it's ok to delete a queue though. For example, never delete the queue unless the client explicitly unsubscribes or has been inactive for so long. Fail to do that, and the whole inbox idea falls apart.
The selected answer above is most similar to what I'm subscribing here in that the subscription is the queue. If I did this, then I'd probably implement it as an internal bus instead of an observer (since it would be unnecessary) - You create a consumer on demand for a connecting client that literally just forwards the message. The message consumer subscribes and unsubscribes based on whether or not the client is connected. As Kamal noted, you'll run into problems if your scale exceeds the maximum number of subscriptions allowed by pubsub. If you find yourself in that position, then you can unshackle that constraint by implementing the pattern above. It's basically the same pattern but you shift the responsibility over to your infra where the only constraint is your own resources.
gRPC makes this mechanism pretty easy. Alternatively, for web, if you're on a Microsoft stack, then SignalR makes this pretty easy too. Clients connect to the hub, and you can publish to all connected clients. The consumer pattern here remains mostly the same, but you don't have to implement the observer pattern by hand.
(note: arrows in diagram are in the direction of dependency, not data flow)

How is Google Cloud Pub/Sub avoiding clock skew

I am looking into ways to order list of messages from google cloud pub/sub. The documentation says:
Have a way to determine from all messages it has currently received whether or not there are messages it has not yet received that it needs to process first.
...is possible by using Cloud Monitoring to keep track of the pubsub.googleapis.com/subscription/oldest_unacked_message_age metric. A subscriber would temporarily put all messages in some persistent storage and ack the messages. It would periodically check the oldest unacked message age and check against the publish timestamps of the messages in storage. All messages published before the oldest unacked message are guaranteed to have been received, so those messages can be removed from persistent storage and processed in order.
I tested it locally and this approach seems to be working fine.
I have one gripe with it however, and this is not something easily testable by myself.
This solution relies on server-side assigned (by google) publish_time attribute. How does Google avoid the issues of skewed clocks?
If my producer publishes messages A and then immediately B, how can I be sure that A.publish_time < B.publish_time is true? Especially considering that the same documentation page mentions internal load-balancers in the architecture of the solution. Is Google Pub/Sub using atomic clocks to synchronize time on the very first machines which see messages and enrich those messages with the current time?
There is an implicit assumption in the recommended solution that the clocks on all the servers are synchronized. But the documentation never explains if that is true or how it is achieved so I feel a bit uneasy about the solution. Does it work under very high load?
Notice I am only interested in relative order of confirmed messages published after each other. If two messages are published simultaneously, I don't care about the order of them between each other. It can be A, B or B, A. I only want to make sure that if B is published after A is published, then I can sort them in that order on retrieval.
Is the aforementioned solution only "best-effort" or are there actual guarantees about this behavior?
There are two sides to ordered message delivery: establishing an order of messages on the publish side and having an established order of processing messages on the subscribe side. The document to which you refer is mostly concerned with the latter, particularly when it comes to using oldest_unacked_message_age. When using this method, one can know that if message A has a publish timestamp that is less than the publish timestamp for message B, then a subscriber will always process message A before processing message B. Essentially, once the order is established (via publish timestamps), it will be consistent. This works if it is okay for the Cloud Pub/Sub service itself to establish the ordering of messages.
Publish timestamps are not synchronized across servers and so if it is necessary for the order to be established by the publishers, it will be necessary for the publishers to provide a timestamp (or sequence number) as an attribute that is used for ordering in the subscriber (and synchronized across publishers). The subscriber would sort message by this user-provided timestamp instead of by the publish timestamp. The oldest_unacked_message_age will no longer be exact because it is tied to the publish timestamp. One could be more conservative and only consider messages ordered that are older than oldest_unacked_message_age minus some delta to account for this discrepancy.
Google Cloud Pub-sub does not guarantee order of events receive to consumers as they were produced. Reason behind that is Google Cloud Pub-sub also running on a cluster of nodes. The possibility is there an event B can reach the consumer before event A. To Ensure ordering you have to make changes on both producer and consumer to identify the order of events. Here is section from docs.

Is Amazon SQS a good service for syncup between two components?

I have two components - one cloud based CLS app. and the other is normal Java based admin which talks to MySQL.
Considering SQS is not FIFO and I am not sure when will I receive the message at my consumer end. Also, I might receive a new message before the previous message on same data causing data inconsistency
If I want to syncup data between these two systems, is SQS a good service ?
Is SQS generally a good tool in such sync up scenarios?
SQS is "loosely-FIFO" and the SQS FAQ recommends adding sequencing information to each message to achieve ordering:
If your system requires the order of messages to be preserved, place
sequencing information in each message so that messages can be ordered
when they are received. Source
Messages that need to arrive in a specific order may not be a good candidate for standard SQS queue. However you can set a message sequence counter while sending message. At receiving end, you can keep processing messages if sequence is right. In case an out of sequence message comes, wait till the right message comes and then process right sequence message and others which came in between.
On Nov 17th, 2016 FIFO Queue have been introduced in certain regions (US East (Ohio) and US West (Oregon)) which complements the standard queue. The order in which messages are sent and received is strictly preserved and a message is delivered once and remains available until a consumer processes and deletes it; duplicates are not introduced into the queue.
FIFO queues use the same API actions as standard queues, and the mechanics for receiving and deleting messages and changing the visibility timeout are the same. However, when sending messages, you must specify a message group ID.
Amazon SQS has just gained FIFO Queues with Exactly-Once Processing & Deduplication:
Today we are making SQS even more powerful and flexible with support
for FIFO (first-in, first-out) queues. We are rolling out this new
type of queue in two regions now, and plan to make it available in
many others in early 2017.
These queues are designed to guarantee that messages are processed
exactly once, in the order that they are sent, and without duplicates.
[...]
[emphasis mine]
As emphasized, these new FIFO SQS queues provide more options to cover the use case at hand, but are not yet available in all SQS regions [initially only in US East (Ohio) and US West (Oregon)]. Also, the SQS FAQ for FIFO queues outlines notable differences between standard and FIFO queues that should be considered before deciding which queue type matches a particular use case.

Using Amazon SQS with multiple consumers

I have a service-based application that uses Amazon SQS with multiple queues and multiple consumers. I am doing this so that I can implement an event-based architecture and decouple all the services, where the different services react to changes in state of other systems. For example:
Registration Service:
Emits event 'registration-new' when a new user registers.
User Service:
Emits event 'user-updated' when user is updated.
Search Service:
Reads from queue 'registration-new' and indexes user in search.
Reads from queue 'user-updated' and updates user in search.
Metrics Service:
Reads from 'registration-new' queue and sends to Mixpanel.
Reads from queue 'user-updated' and sends to Mixpanel.
I'm having a number of issues:
A message can be received multiple times when doing polling. I can design a lot of the systems to be idempotent, but for some services (such as the metrics service) that would be much more difficult.
A message needs to be manually deleted from the queue in SQS. I have thought of implementing a "message-handling-service" that handles the deletion of messages when all the services have received them (each service would emit a 'message-acknowledged' event after handling a message).
I guess my question is this: what patterns should I use to ensure that I can have multiple consumers for a single queue in SQS, while ensuring that the messages also get delivered and deleted reliably. Thank you for your help.
I think you are doing it wrong.
It looks to me like you are using the same queue to do multiple different things. You are better of using a single queue for a single purpose.
Instead of putting an event into the 'registration-new' queue and then having two different services poll that queue, and BOTH needing to read that message and both doing something different with it (and then needing a 3rd process that is supposed to delete that message after the other 2 have processed it).
One queue should be used for one purpose.
Create a 'index-user-search' queue and a 'send to mixpanels' queue,
so the search service reads from the search queues, indexes the user
and immediately deletes the message.
The mixpanel-service reads from the mix-panels queue, processes the
message and deletes the message.
The registration service, instead of emiting a 'registration-new' to a single queue, now emits it to two queues.
To take it one step better, add SNS into the mix here and have the registration service emit an SNS message to the 'registration-new' topic (not queue), and then subscribe both of the queues I mentioned above, to that topic in a 'fan-out' pattern.
https://aws.amazon.com/blogs/aws/queues-and-notifications-now-best-friends/
Both queues will receive the message, but you only load it into SNS once - if down the road a 3rd unrelated service needs to also process 'registration-new' events, you create another queue and subscribe it to the topic as well - it can run with no dependencies or knowledge of what the other services are doing - that is the goal.
The primary use-case for multiple consumers of a queue is scaling-out.
The mechanism that allows for multiple consumers is the Visibility Timeout, which gives a consumer time to process and delete a message without it being consumed concurrently by another consumer.
To address the "At-Least-Once Delivery" property of Standard Queues,
the consuming service should be idempotent.
If that isn't possible, one possible solution is to use FIFO queues, but this mode has a limited message delivery rate and is not compatible with SNS subscription.
They even have a tutorial on how to create a fanout scenario using the combo SNS+SQS.
https://aws.amazon.com/getting-started/tutorials/send-fanout-event-notifications/
Too bad it does not support FIFO queues so you have to be careful to handle out of order messages.
It would be nice if they had a consistent hashing solution to have multiple competing consumers while respecting the message order.