Determine concurrency based on a key (such as an id) with RabbitMQ - concurrency

We are building a web application using RabbitMQ and Spring's listener-containers to produce concurrency as follows:
<rabbit:listener-container connection-factory="connectionFactory" concurrency="10">
<rabbit:listener ref="FooService" method="handleFoo" queue-names="fooQueue"/>
</rabbit:listener-container>
<rabbit:topic-exchange name="exchange">
<rabbit:bindings>
<rabbit:binding queue="fooQueue" pattern="foo.handle"/>
</rabbit:bindings>
</rabbit:topic-exchange>
I want the listeners to process messages concurrently ( e.g. 10 threads for this example ), but I don't want them to concurrently process messages with same data. For example if I am sending id's of Foo objects, I only want different Foo objects to be concurrently processed, but same Foo objects should be sequentially processed.
I have gone over the exchange and queue types of RabbitMQ but could not figure out to do this with any of them.
One way I can think of is to create multiple queues with different patterns such as foo.handle.1, foo.handle.2 and so forth. And then hash the id of the Foo objects to these patterns. But doing this for every type of queue we have and managing all of it can get out of hands quite easily.
Is there a mechanism to accomplish this with RabbitMQ?

Unlike JMS, RabbitMQ (or AMQP itself) has no concept of a message selector - you can't pull selective messages from a queue.
The only solution with RabbitMQ is a separate queue for each type and a single consumer on each queue.

Related

Single SQS Queue vs Multiple SQS Queue while creating a Async Model

I have to develop a component where the Apis are async in nature. In order to develop this async model, I am going to use Aws SQS queues for publishing messages and the client will read from the queue and send the response back into the queue. Now there are 10 APIs (currently) that I have to expose.
Currently, I can think of having a single request and a single response queue (which I will poll) for all the APIs and the payload of the APIs can be defined by some Operation.
The other way is to use a separate queue for each API. The advantage that I can see for multiple queues is that each API can have different traffic and having multiple queues can help the client of the queues to scale effectively.
What can be other pros or cons for both the approaches?
Separate your use-case into 2 distinct problems:
Problem 1: APIs to Workers, one queue or multiple?
If your workers do different types of work, then having a single queue will require them to inspect then discard messages they don't care about. If this is the case, then you should have one queue per message type. This way, any message a worker receives from the queue, it should be able to handle.
If you start ignoring messages, then other workers, who may be idle, may be waiting for a while for messages it cares about.
Problem 2: Using a return queue for the "results". If your clients will be polling for results, then at each poll, your API will need to poll the queue. Again, it will be "searching" for the right response, discarding those it doesn't care about, starving other clients.
Recommendation:
Use multiple queues, one per "worker type". Workers should be able to process any message it receives from the queue.
Then use something other than SQS to store the result. One option is to use S3 to store the result:
When your API "creates" the task, create an object in S3 and put a reference to that S3 object on your SQS queue.
Your worker will do the work, then put the result where it was told to.
When your client polls your API for the result, your API will check S3 and return the status/results.
Instead of S3, other data stores could be used if appropriate: RDS, DynamoDB, etc.

Clojure: future, agent or core.async for IO

Say that there are 4 components:
interactive data collection via http (survey)
atom, which accumulates survey stages
cpu-heavy computation
database writer
I try to achieve last two operations be asyncronous, like placing accumulated data somewhere in a queue and start collecting other survey, while something processes data and something else does IO with any of previously prepared data.
The quesion is what feature of language to use? In Clojure Programming book there are examples of using agents as components to perform IO operations, but doesn't futures offer same "fire and forget" facilities? Or agent is still an identity like atom or ref and not an actor at all?
An agent is a container for mutable state, just like atoms and refs. You send an action to an agent, which is "fire and forget," with the guarantee that each action sent to that agent will be executed only once (unlike functions swap!ped on an atom, and alter!s for refs, which can be retried multiple times potentially). Multiple actions sent to the agent will be completed serially, not concurrently.
Thus, I/O can safely be performed within an action sent to an agent. So, an agent may be a good candidate for your step #4, as you want each database write to occur only once (never retried).
You would have a single agent, representing your database state. In each future/thread you create in step #3, after you perform your computation, you send the action to the lone agent, which will serialize the actions, even if they arrive at the same time from your threads.

Chat bots: ensuring serial processing of messages on a per-conversation basis in clustered environment

In the context of writing a Messenger chat bot in a cloud environment, I'm facing some concurrency issues.
Specifically, I would like to ensure that incoming messages from the same conversation are processed one after the other.
As a constraint, I'm processing the messages with workers in a Cloud environment (i.e the worker pool is of variable size and worker instances are potentially short-lived and may crash). Also, low latency is important.
So abstracting a little, my requirements are:
I have a stream of incoming messages
each of these messages has a 'topic key' (the conversation id)
the set of topics is not known ahead-of-time and is virtually infinite
I want to ensure that messages of the same topic are processed serially
on a cluster of potentially ephemeral workers
if possible, I would like reliability guarantees e.g making sure that each message is processed exactly once.
My questions are:
Is there a name for this concurrency scenario?.
Are there technologies (message brokers, coordination services, etc.) which implement this out of the box?
If not, what algorithms can I use to implement this on top of lower-level concurrency tools? (distributed locks, actors, queues, etc.)
I don't know of a widely-accepted name for the scenario, but a common strategy to solve that type of problem is to route your messages so that all messages with the same topic key end up at the same destination. A couple of technologies that will do this for you:
With Apache ActiveMQ, HornetQ, or Apache ActiveMQ Artemis, you could use your topic key as the JMSXGroupId to ensure all messages with the same topic key are processed in-order by the same consumer, with failover
With Apache Kafka, you could use your topic key as the partition key, which will also ensure all messages with the same topic key are processed in-order by the same consumer
Some message broker vendors refer to this requirement as Message Grouping, Sticky Sessions, or Sticky Message Load Balancing.
Another common strategy on messaging systems with weaker delivery/ordering guarantees (like Amazon SQS) is to simply include a sequence number in the message and leave it up to the destination to resequence and request redelivery of missing messages as needed.
I think you can fix this by using a queue and a set. What I can think of is sending every message object in queue and processing it as first in first out. But while adding it in queue add topic name in set and while taking it out for processing remove topic name from set.
So now if you have any topic in set then don't add another message object of same topic in queue.
I hope this will help you. All the best :)

does new instance of actor get created when too many msg?

I recently learned about the akka,but some idea I can't grasp.
my question is, if there are too many message in queue,will a new actor be created?
in many framework,for example, one http-requet message coming,and the framework found that the current "worker" are busy,so the framework will create another "worker " to process the new message in another thread.
but it seems the akka doesn't do this way,there is only one actor instance.
so I think the "busy actor" will bocking the queue, which will hit the throughout and performance , am I correct?
Each Actor stores their messages in a Mailbox.
http://doc.akka.io/docs/akka/current/scala/mailboxes.html
The default mailbox is unbounded and non-blocking. If your actor cannot process messages quickly enough, their mailbox balloons in size and consumes increasing amounts of RAM. You can configure Akka to use a bounded, blocking Mailbox which will block the sender when over capacity.
If you would like to dynamically manage a pool of actors, look into Routing strategies.
http://doc.akka.io/docs/akka/2.4.1/scala/routing.html
You can create a Router Actor that receives messages and passes them to routee actors. The Router also manages the routee pool and can dynamically generate routees as needed.
Also, if using Future and callback asynchronous execution, your actors will not block on http requests.
TL;DR:
If you send messages faster than your Actor can process them, eventually your application will start dropping messages.
Longer answer:
As I understand, every Akka Actor has a Queue associated with it, which holds all the messages it receives.
If you send messages to this Actor, faster than the Actor can process them, eventually the queue will overflow, since messages on the queue are kept in ram.
It is not possible to spawn another Actor, on the fly. Since the messages on the queue are processed in order. This ordering will be broken if more than one consumer exists.
I would suggest you take a look at Akka Streams, this is a higher level API built on top of actors, and guards you against this kind of thing by providing backpressure throughout your system. This means that if the actor you're sending messages to is slower than whoever is producing the messages, the consumer will ask the producer to slow down, and will not overflow your Actor's queue.

Using Amazon SQS with multiple consumers

I have a service-based application that uses Amazon SQS with multiple queues and multiple consumers. I am doing this so that I can implement an event-based architecture and decouple all the services, where the different services react to changes in state of other systems. For example:
Registration Service:
Emits event 'registration-new' when a new user registers.
User Service:
Emits event 'user-updated' when user is updated.
Search Service:
Reads from queue 'registration-new' and indexes user in search.
Reads from queue 'user-updated' and updates user in search.
Metrics Service:
Reads from 'registration-new' queue and sends to Mixpanel.
Reads from queue 'user-updated' and sends to Mixpanel.
I'm having a number of issues:
A message can be received multiple times when doing polling. I can design a lot of the systems to be idempotent, but for some services (such as the metrics service) that would be much more difficult.
A message needs to be manually deleted from the queue in SQS. I have thought of implementing a "message-handling-service" that handles the deletion of messages when all the services have received them (each service would emit a 'message-acknowledged' event after handling a message).
I guess my question is this: what patterns should I use to ensure that I can have multiple consumers for a single queue in SQS, while ensuring that the messages also get delivered and deleted reliably. Thank you for your help.
I think you are doing it wrong.
It looks to me like you are using the same queue to do multiple different things. You are better of using a single queue for a single purpose.
Instead of putting an event into the 'registration-new' queue and then having two different services poll that queue, and BOTH needing to read that message and both doing something different with it (and then needing a 3rd process that is supposed to delete that message after the other 2 have processed it).
One queue should be used for one purpose.
Create a 'index-user-search' queue and a 'send to mixpanels' queue,
so the search service reads from the search queues, indexes the user
and immediately deletes the message.
The mixpanel-service reads from the mix-panels queue, processes the
message and deletes the message.
The registration service, instead of emiting a 'registration-new' to a single queue, now emits it to two queues.
To take it one step better, add SNS into the mix here and have the registration service emit an SNS message to the 'registration-new' topic (not queue), and then subscribe both of the queues I mentioned above, to that topic in a 'fan-out' pattern.
https://aws.amazon.com/blogs/aws/queues-and-notifications-now-best-friends/
Both queues will receive the message, but you only load it into SNS once - if down the road a 3rd unrelated service needs to also process 'registration-new' events, you create another queue and subscribe it to the topic as well - it can run with no dependencies or knowledge of what the other services are doing - that is the goal.
The primary use-case for multiple consumers of a queue is scaling-out.
The mechanism that allows for multiple consumers is the Visibility Timeout, which gives a consumer time to process and delete a message without it being consumed concurrently by another consumer.
To address the "At-Least-Once Delivery" property of Standard Queues,
the consuming service should be idempotent.
If that isn't possible, one possible solution is to use FIFO queues, but this mode has a limited message delivery rate and is not compatible with SNS subscription.
They even have a tutorial on how to create a fanout scenario using the combo SNS+SQS.
https://aws.amazon.com/getting-started/tutorials/send-fanout-event-notifications/
Too bad it does not support FIFO queues so you have to be careful to handle out of order messages.
It would be nice if they had a consistent hashing solution to have multiple competing consumers while respecting the message order.