Clojure Recipe for Dynamic Channels/pub-sub - clojure

I'm looking for a way to dynamically build up subscribers to publications using core.async (or anything that will work).
Problem: I have messages that I need to process based on the :sender of the message. Each message will be operated on by the same function, but the point is that each sender's messages will be processed in order, one at a time -- multiple topics based on the :sender key with one consumer each. I also need a way to limit the number of active consumers across all subscriptions to keep resource utilization down.
My thought is that I would have a pub channel:
(def in-chan (chan))
(def publication (pub in-chan :sender))
But I want to be able to ensure that there is always a subscriber as new senders are brought online. I'm open to better facilities as long as the code stays small and simple.
Question: Is there an idiomatic way of ensuring there is a subscriber for a specific publication before sending the message? How do I coordinate all the consumers of each subscription to use a shared thread pool?
EDIT: I’ve figured out how to coordinate the work using a thread pool and single consumer per topic. I think for checking if a sub exists I will use a ref of a map to store the topic name to the sub. If the ref doesn’t have an entry for a subscriber, I’d create one and add it to the map; next I’ll register the subscriber to the publication and publish the message. The purpose of this question is to see if there’s a better way to spin up and keep track of subscribers for dynamically created topics.

The solution I came up with is to use a ref:
(def registration (ref {}))
This registration is used by the thread that writes to the publication channel just before writing:
(defn register
[registration topic-name]
(dosync
(let [r (ensure registration)
something-to-track :tracked] ;; In my case, I'm keeping track of a channel
(when-not (get r topic-name)
(alter registration assoc topic-name something-to-track)
something-to-track))))
Whenever we need to publish a message we can use this function to "register" a new subscriber. If one did not previously exist, it will return something-to-track. In my case, it is a channel that I will subsequently call sub on. If it is nil, I can ignore it. To truly not miss messages in a concurrent environment I would need to do something in the transaction (need to understand if ensure would protect me by giving exclusive access to the registration across threads) but my pipeline is so small that I can write to the pub single-threaded.

Related

How to do queries on a collection of actors

I have an actor system that at the moment accepts commands/messages. The state of these actors is persisted Akka.Persistance. We now want to build the query system for this actor system. Basically our problem is that we want to have a way to get an aggregate/list of all the states of these particular actors. While I'm not strictly subscribing to the CQRS pattern I think that it might be a neat way to go about it.
My initial thoughts was to have an actor for querying that holds as part of its state an aggregation of the states of the other actors that are doing the "data writes". And to do this this actor will subscribe to the actors its interested to and these actors would just send the query actor their states when they undergo some sort of state change. Is this the way to go about this? is there a better way to do this?
My recommendation for implementing this type of pattern is to use a combination of pub-sub and push-and-pull messaging for your actors here.
For each "aggregate," this actor should be able to subscribe to events from the individual child actors you want to query. Whenever a child's state changes, a message is pushed into all subscribed aggregates and each aggregate's state is updated automatically.
When a new aggegrate comes online and needs to retrieve state it missed (from before it existed) it should be able to pull the current state from each child and use that to build its current state, using incremental updates from children going forward to keep its aggregated view of the children's state consistent.
This is the pattern I use for this sort of work and it works well locally out of the box. Over the network, you may have to ensure deliverability guarantees and that's generally easy to do. You can read a bit more on how to do that there: https://petabridge.com/blog/akkadotnet-at-least-once-message-delivery/
Some of Akka.Persistence backends (i.e. those working with SQL) also implement something known as Akka.Persistence.Query. It allows you to subscribe to a stream of events that are produced, and use this as a source for Akka.Streams semantics.
If you're using SQL-journals you'll need Akka.Persistence.Query.Sql and Akka.Streams packages. From there you can create a live (that means continuously updated) source of events for a particular actor and use it for any operations you like i.e print them:
using (var system = ActorSystem.Create("system"))
using (var materializer = system.Materializer())
{
var queries = Sys.ReadJournalFor<SqlReadJournal>(SqlReadJournal.Identifier)
queries.EventsByPersistenceId("<persistence-id>", 0, long.MaxValue)
.Select(envelope => envelope.Event)
.RunForEach(e => Console.WriteLine(e), materializer);
}

Cache a common resource Using Akka

Hey guys I want to do the following:
Say i have some n actors which are all reading from some common variable called x.
In the background I want to schedule an actor which will keep updating this variable x say every 5-10 minutes.
I dont ever want the n actors to wait for this value to be updated. They should get some value even while x is being updated.
So how can I handle this situation in the best possible way?
Irrespectively of an actor model, two general approaches to solve it are push (when caching agent sends update notifications to clients and they update their local caches) or pull (when client hits caching agent every time).
In either case there is a "current" cache version that should be immutable (to prevent concurrency issues). In the push models clients maintain it locally, on pull models it is maintained in the caching agent. From here, you can have many design choices that are driven by you application needs that lead to different trade-offs.
Roughly, if you want to keep clients simple use pull model. You buy this simplicity at the cost of loosing control of freshness of your cache and giving up the knowledge of update notifications. This also leads to a more complicated communication process.
If you want to be current with the actual data and know when cache is updated (and potentially control update process), use push model. I'd go with that in your case, because it's very simple to implement with actors. A possible implementation in pseudo-scala:
class Worker extends Actor {
var cache: String
def receive = {
case CacheUpdate(newValue) => cache = newValue
}
}
class Publisher extends Actor {
val workers = new mutable.ListBuffer[ActorRef]()
def receive = {
case AddWorker(actor) =>
workers += actor
context.watch(actor) // this is important to keep workers list current
case Terminated(actor) => workers -= actor
case Update(newValue) => workers.foreach(_ ! CacheUpdate(newValue))
}
}
You can either send the AddWorker message as a part of lifecycle (in which case you need to pass Publisher in a constructor), or you can coordinate it externally.
It's considered a bad practice to share mutable objects among different actors, and the way you explain it, your variable 'x' is mutable and it's shared.
The proper way to share information among actors is via immutable messages.
One of the possible solutions would be:
having an actor that creates your 'n' actors
this same actor schedules a message to self
on the processing of this message, the variable is updated
after this, this actor sends a message to its children (the 'n' actors) with a copy (never share something mutable) of the value of variable 'x'
each of your 'n' actors will receive the new value as a message and they can you whatever is expected from them.
You can learn this article it contains detailed example with caсhing via ConsistentHashable

Why does the CPU profile in Visual vm show the process spends all its time in a promise deref when using clojure's core async to read a kafka stream

I am running a clojure app reading from a kaka stream. I am using the shovel github project https://github.com/l1x/shovel to read from a kafka stream. When I profiled my application using visual vm looking for hotspots I noticed that most of the cpu time about 70% is being spent in the function clojure.core$promise$reify__6310.deref.
The shovel api consumer is a thinwrapper on the Kafka consumergroup api. It reads from a kafka topic and publishes out to a core async channel. Should i be concerned that my application latencies would be affected if i continued using this api. Is there any explanation why the reify on the promise is taking this much cpu time.
In Clojure, $ is used in the printed representation of a class to represent an inner class. clojure.core$promise$reify__6310.deref means calling the method deref on a class that is created via reify as an inner class of clojure.core/promise. As it turns out, if you look at the class of a promise, it will show up as an inner reified class inside clojure.core$promise.
A promise in Clojure represents data that may not yet be available. You can see its behavior in a repl:
user> (def p (promise))
#'user/p
user> (class p)
clojure.core$promise$reify__6363
user> (deref p)
This will hang and give no result, and not give the next repl prompt, until you deliver to the promise from another repl connection, or interrupt the deref call. The fact that time is being spent on deref of a promise simply means that the program logic is waiting on values that are not yet computed (or have not yet come in via the network, etc.).

How do you process messages in parallel while ensuring FIFO per entity?

Let's say you have an entity, say, "Person" in your system and you want to process events that modify various Person entities. It is important that:
Events for the same Person are processed in FIFO order
Multiple Person event streams be processed in parallel by different threads/processes
We have an implementation that solves this using a shared database and locks. Threads compete to acquire the lock for a Person and then process events in order after acquiring the lock. We'd like to move to a message queue to avoid polling and locking, which we feel would reduce load on the DB and simplify the implementation of the consumer code.
I've done some research into ActiveMQ, RabbitMQ, and HornetQ but I don't see an obvious way to implement this.
ActiveMQ supports consumer subscription wildcards, but I don't see a way to limit the concurrency on each queue to 1. If I could do that, then the solution would be straightforward:
Somehow tell broker to allow a concurrency of 1 for all queues starting with: /queue/person.
Publisher writes event to queue using Person ID in the queue name. e.g.: /queue/person.20
Consumers subscribe to the queue using wildcards: /queue/person.>
Each consumer would receive messages for different person queues. If all person queues were in use, some consumers may sit idle, which is ok
After processing a message, the consumer sends an ACK, which tells the broker it's done with the message, and allows another message for that Person queue to be sent to another consumer (possibly the same one)
ActiveMQ came close: You can do wildcard subscriptions and enable "exclusive consumer", but that combination results in a single consumer receiving all messages sent to all matching queues, reducing your concurrency to 1 across all Persons. I feel like I'm missing something obvious.
Questions:
Is there way to implement the above approach with any major message queue implementation? We are fairly open to options. The only requirement is that it run on Linux.
Is there a different way to solve the general problem that I'm not considering?
Thanks!
It looks like JMSXGroupID is what I'm looking for. From the ActiveMQ docs:
http://activemq.apache.org/message-groups.html
Their example use case with stock prices is exactly what I'm after. My only concern is what happens if the single consumer dies. Hopefully the broker will detect that and pick another consumer to associate with that group id.
One general way to solve this problem (if I got your problem right) is to introduce some unique property for Person (say, database-level id of Person) and use hash of that property as index of FIFO queue to put that Person in.
Since hash of that property can be unwieldy big (you can't afford 2^32 queues/threads), use only N the least significant bits of that hash.
Each FIFO queue should have dedicated worker that will work upon it -- voila, your requirements are satisfied!
This approach have one drawback -- your Persons must have well-distributed ids to make all queues work with more-or-less equal load. If you can't guarantee that, consider using round-robin set of queues and track which Persons are being processed now to ensure sequential processing for same person.
If you already have a system that allows shared locks, why not have a lock for every queue, which consumers must acquire before they read from the queue?

Is Clojure's "send" asynchronous?

I'm writing a simple networking framework for Clojure using Java's new I/O package. It manages a pool of "selector agents", each of which holds a Selector.
I defined a dispatch action to for the selector agent. This action blocks on a call to selector.select(). When that returns, the selector agent iterates over the selected keys and performs I/O. When I/O is completed, the selector agent send's itself the dispatch action using send-off, effectively looping on calls to selector.select().
When I want to add a new channel or change a channel's interest ops, I send the selector agent the appropriate action and then unblock the selector (it's blocked on select(), remember?). This ensures that (send-off selector-agent dispatch) in the selector agent is executed after (send selector-agent add-channel channel).
I thought this would be bullet-proof since the call to send-off is performed before the selector waking up, and thus, before the selector agent send itself the dispatch action. However, this yields inconsistent behavior. Sometimes, the dispatch action occurs first and sometimes it doesn't.
My understanding is that it's not guaranteed that agents execute actions in the exact order they were sent when they come from multiple threads (i.e. send and send-off are not synchronous as far as queuing actions is concerned).
Is this correct?
send and send-off guarantee that actions will be placed on the Agent's queue in the order they are sent, within a single thread. Updating the Agent's queue happens synchronously.
I expect you have a simple race condition, although I can't identify it from the description.
You are absolutely right. Actions coming from the same thread will be executed in the same order, as they were submitted. But you cannot make any assumptions about execution order of actions, that come from different threads.
send and send off are built for asynchronous state changes.
if you need synchronous updates then atoms are likely your best tool.
since you need to preserve the order of requests you may have to use another data structure within a concurrency object (atom) can be syncrounoustly updated. It may work to put a persistent-queue inside an atom and have all your threads synchronousness add to that queue while your consumers synchronously pull entries from it.
here is the super brief decission chart:
more than one and synchronous: use a ref
asynchronous one: use an agent
asynchronous more than one: agents within a dosync
synchronous and only one: use an agent.