How to properly batch messages with core.async? - clojure

I would like to batch messages on a core.async chan by count and timeout, (i.e. 10ms or 10 messages, whichever comes first). Tim Baldridge has a video on batching, but it uses deprecated functions in core.async and does not use transducers. I'm looking for something like the following...
(defn batch [in out max-time max-count]
...
)

Transducers shouldn't really be a concern for a batching function – as a taker on the in channel, it will see values transformed by any transducers on that channel, and any takers listening on out will in turn see values transformed by that channel's transducer.
As for an implementation, the function below will take batches of max-count items from in, or however many arrive by max-time since the last batch was output, and output them to out, closing when the input channel closes, subject to the input channel's transducer (if any, and any takers listening on out will also have that channel's transducer applied as noted above):
(defn batch [in out max-time max-count]
(let [lim-1 (dec max-count)]
(async/go-loop [buf [] t (async/timeout max-time)]
(let [[v p] (async/alts! [in t])]
(cond
(= p t)
(do
(async/>! out buf)
(recur [] (async/timeout max-time)))
(nil? v)
(if (seq buf)
(async/>! out buf))
(== (count buf) lim-1)
(do
(async/>! out (conj buf v))
(recur [] (async/timeout max-time)))
:else
(recur (conj buf v) t))))))

Related

How can I return a vector?

I have a channel where I am putting values into inside a doseq loop.
This code reads from a list of isbns and for each isbn, does an amazon search to return contents of a book, and then calls another function to get the title and rank
(def book_channel (chan 10))
make sure you use clojure.core.async/into rather than clojure.core/into. Here is an example of a round trip from collection to channel and back to collection:
user> (require '[clojure.core.async :as async :refer [<! <!! >!! >! chan go]])
nil
user> (def book-chan (async/to-chan [:book1 :book2 :book3]))
#'user/book-chan
user> (<!! (clojure.core.async/into [] book-chan))
[:book1 :book2 :book3]
clojure.core.async/into returns a channel that will have exactly one item written to it. That one item will be written once it's input channel closes. This keeps the whole thing asynchronous and it does require that the code putting things into the book-channel close the chan to signal that all the books are there.
You need to do some type of coordination to determine when all of your work is finished. You can pull that coordination out into the main thread fairly easily:
(def book_channel (chan 10))
(defn concurrency_test
[list_of_isbns]
(doseq [isbn list_of_isbns]
(go (>! book_channel
(get_title_and_rank_for_one_isbn
(amazon_search isbn)))))
(prn (loop [results []]
(if (= (count results) (count list_of_isbns))
results
(recur (conj results (<!! book_channel)))))))
Here, I used a loop that keeps waiting for results and adding them to the vector until we have as many results as we do isbns. You'll want to make sure that get_title_and_rank_for_one_isbn always generates a result that can be put on a channel, otherwise the loop will wait forever.
You should close! the book_channel after you finish pushing stuff into it. Per async/into documentation - "ch must close before into produces a result."
(let [book> (chan)]
(go
(doseq [e (range 8)]
(>! book> e))
(close! book>))
(<!! (async/into [] book>)))
Alternatively, you can use async/onto-chan which will close the channel for you:
(let [book> (chan)]
(async/onto-chan book> (range 8))
(<!! (async/into [] book>)))

what's the best way to alts!! on a vector of channel multiple times?

I'm using core.async to do something in parallel, and then using alts!! wait on certain amount of result with timeout.
(ns c
(:require [clojure.core.async :as a]))
(defn async-call-on-vector [v]
(mapv (fn [n]
(a/go (a/<! (a/timeout n)) ; simulate long time work
n))
v))
(defn wait-result-with-timeout [chans num-to-get timeout]
(let [chans-count (count chans)
num-to-get (min num-to-get
chans-count)]
(if (empty? chans)
[]
(let [timeout (a/timeout timeout)]
(loop [result []
met 0]
(if (or (= (count result) num-to-get)
(= met chans-count)) ; all chan has been consumed
result
(let [[v c] (a/alts!! (conj chans timeout))]
(if (= c timeout)
result
(case v
nil (do (println "got nil") (recur result met)) ; close! on that channel
(recur (conj result v) (inc met)))))))))))
and then invoke like:
user=> (-> [1 200 300 400 500] c/async-call-on-vector (c/wait-result-with-timeout 2 30))
this expression will prints out a lot of got nil. It seems channel returned by go block will close that channel after result has been returned. And this will causes alts!! return nil on this case. but this is very CPU unfriendly, it just like busy waiting. Is there a way to avoid this?
I solved this by define a macro like go, but return a channel that will not closed on result returned. Is this a right way to solve it?
I'm using core.async to do something in parallel, and then using alts!! wait on certain amount of result with timeout.
It looks like you want to collect all of the values that will be delivered by some channels, until all of those channels are closed, or until a timeout occurs. One way to do that is to merge those channels onto a single channel, and then use alts! within a go-loop to collect the values into a vector:
(defn wait-result-with-timeout [chans timeout]
(let [all-chans (a/merge chans)
t-out (a/timeout timeout)]
(a/go-loop [vs []]
(let [[v _] (a/alts! [all-chans t-out])]
;; v will be nil if either every channel in
;; `chans` is closed, or if `t-out` fires.
(if (nil? v)
vs
(recur (conj vs v)))))))
It seems channel returned by go block will close that channel after result has been returned.
You are correct, that is the documented behavior of a go block.
I solved this by define a macro like go, but return a channel that will not closed on result returned. Is this a right way to solve it?
Probably not, although it's not for me to say whether it's right or wrong for your particular use case. Generally speaking, channels should close if they are done delivering values, to indicate the semantics of being done delivering values. For example, the above code uses the closing of all-chans to indicate that there is no more work to wait on.

clojure.async: "<! not in (go ...) block" error

When I evaluate the following core.async clojurescript code I get an error: "Uncaught Error: <! used not in (go ...) block"
(let [chans [(chan)]]
(go
(doall (for [c chans]
(let [x (<! c)]
x)))))
What am I doing wrong here? It definitely looks like the <! is in the go block.
because go blocks can't cross function boundaries I tend to fall back on loop/recur for a lot of these cases. the (go (loop pattern is so common that it has a short-hand form in core.async that is useful in cases like this:
user> (require '[clojure.core.async :as async])
user> (async/<!! (let [chans [(async/chan) (async/chan) (async/chan)]]
(doseq [c chans]
(async/go (async/>! c 42)))
(async/go-loop [[f & r] chans result []]
(if f
(recur r (conj result (async/<! f)))
result))))
[42 42 42]
Why dont you use alts! from Core.Async?
This function lets you listen on multiple channels and know which channel you read from on each data.
For example:
(let [chans [(chan)]]
(go
(let [[data ch] (alts! chans)]
data)))))
You can ask of the channel origin too:
...
(let [slow-chan (chan)
fast-chan (chan)
[data ch] (alts! [slow-chan fast-chan])]
(when (= ch slow-chan)
...))
From the Docs:
Completes at most one of several channel operations. Must be called
inside a (go ...) block. ports is a vector of channel endpoints,
which can be either a channel to take from or a vector of
[channel-to-put-to val-to-put], in any combination. Takes will be
made as if by !. Unless
the :priority option is true, if more than one port operation is
ready a non-deterministic choice will be made. If no operation is
ready and a :default value is supplied, [default-val :default] will
be returned, otherwise alts! will park until the first operation to
become ready completes. Returns [val port] of the completed
operation, where val is the value taken for takes, and a
boolean (true unless already closed, as per put!) for put
Doumentation ref

Agent/actor like constructs in clojure that operate on all messages received since last update

What's best way in clojure to implement something like an actor or agent (asynchronously updated, uncoordinated reference) that does the following?
gets sent messages/data
executes some function on that data to obtain new state; something like (fn [state new-msgs] ...)
continues to receive messages/data during that update
once done with that update, runs the same update function against all messages that have been sent in the interim
An agent doesn't seem quite right here. One must simultaneously send function and data to agents, which doesn't leave room for a function which operates on all data that has come in during the last update. The goal implicitly requires a decoupling of function and data.
The actor model seems generally better suited in that there is a decoupling of function and data. However, all actor frameworks I'm aware of seem to assume each message sent will be processed separately. It's not clear how one would turn this on it's head without adding extra machinery. I know Pulsar's actors accept a :lifecycle-handle function which can be used to make actors do "special tricks" but there isn't a lot of documentation around this so it's unclear whether the functionality would be helpful.
I do have a solution to this problem using agents, core.async channels, and watch functions, but it's a bit messy, and I'm hoping there is a better solution. I'll post it as a solution in case others find it helpful, but I'd like to see what other's come up with.
Here's the solution I came up with using agents, core.async channels, and watch functions. Again, it's a bit messy, but it does what I need it to for now. Here it is, in broad strokes:
(require '[clojure.core.async :as async :refer [>!! <!! >! <! chan go]])
; We'll call this thing a queued-agent
(defprotocol IQueuedAgent
(enqueue [this message])
(ping [this]))
(defrecord QueuedAgent [agent queue]
IQueuedAgent
(enqueue [_ message]
(go (>! queue message)))
(ping [_]
(send agent identity)))
; Need a function for draining a core async channel of all messages
(defn drain! [c]
(let [cc (chan 1)]
(go (>! cc ::queue-empty))
(letfn
; This fn does all the hard work, but closes over cc to avoid reconstruction
[(drainer! [c]
(let [[v _] (<!! (go (async/alts! [c cc] :priority true)))]
(if (= v ::queue-empty)
(lazy-seq [])
(lazy-seq (cons v (drainer! c))))))]
(drainer! c))))
; Constructor function
(defn queued-agent [& {:keys [buffer update-fn init-fn error-handler-builder] :or {:buffer 100}}]
(let [q (chan buffer)
a (agent (if init-fn (init-fn) {}))
error-handler-fn (error-handler-builder q a)]
; Set up the queue, and watcher which runs the update function when there is new data
(add-watch
a
:update-conv
(fn [k r o n]
(let [queued (drain! q)]
(when-not (empty? queued)
(send a update-fn queued error-handler-fn)))))
(QueuedAgent. a q)))
; Now we can use these like this
(def a (queued-agent
:init-fn (fn [] {:some "initial value"})
:update-fn (fn [a queued-data error-handler-fn]
(println "Receiving data" queued-data)
; Simulate some work/load on data
(Thread/sleep 2000)
(println "Done with work; ready to queue more up!"))
; This is a little warty at the moment, but closing over the queue and agent lets you requeue work on
; failure so you can try again.
:error-handler-builder
(fn [q a] (println "do something with errors"))))
(defn -main []
(doseq [i (range 10)]
(enqueue a (str "data" i))
(Thread/sleep 500) ; simulate things happening
; This part stinks... have to manually let the queued agent know that we've queued some things up for it
(ping a)))
As you'll notice, having to ping the queued-agent here every time new data is added is pretty warty. It definitely feels like things are being twisted out of typical usage.
Agents are the inverse of what you want here - they are a value that gets sent updating functions. This easiest with a queue and a Thread. For convenience I am using future to construct the thread.
user> (def q (java.util.concurrent.LinkedBlockingDeque.))
#'user/q
user> (defn accumulate
[summary input]
(let [{vowels true consonents false}
(group-by #(contains? (set "aeiouAEIOU") %) input)]
(-> summary
(update-in [:vowels] + (count vowels))
(update-in [:consonents] + (count consonents)))))
#'user/accumulate
user> (def worker
(future (loop [summary {:vowels 0 :consonents 0} in-string (.take q)]
(if (not in-string)
summary
(recur (accumulate summary in-string)
(.take q))))))
#'user/worker
user> (.add q "hello")
true
user> (.add q "goodbye")
true
user> (.add q false)
true
user> #worker
{:vowels 5, :consonents 7}
I came up with something closer to an actor, inspired by Tim Baldridge's cast on actors (Episode 16). I think this addresses the problem much more cleanly.
(defmacro take-all! [c]
`(loop [acc# []]
(let [[v# ~c] (alts! [~c] :default nil)]
(if (not= ~c :default)
(recur (conj acc# v#))
acc#))))
(defn eager-actor [f]
(let [msgbox (chan 1024)]
(go (loop [f f]
(let [first-msg (<! msgbox) ; do this so we park efficiently, and only
; run when there are actually messages
msgs (take-all! msgbox)
msgs (concat [first-msg] msgs)]
(recur (f msgs)))))
msgbox))
(let [a (eager-actor (fn f [ms]
(Thread/sleep 1000) ; simulate work
(println "doing something with" ms)
f))]
(doseq [i (range 20)]
(Thread/sleep 300)
(put! a i)))
;; =>
;; doing something with (0)
;; doing something with (1 2 3)
;; doing something with (4 5 6)
;; doing something with (7 8 9 10)
;; doing something with (11 12 13)

Process a collection at a timed interval or when it reaches a certain size

I am reading Strings in from the standard input with
(line-seq (java.io.BufferedReader. *in*))
How can I:
Store the lines in a collection
At some interval (say 5 minutes) process the collection and also
Process the collection as soon as its size grows to n (say 10) regardless of timing?
Here i left you my purposes:
As you can check in http://clojuredocs.org/clojure_core/clojure.core/line-seq the result of(line-seq (BufferedReader. xxx)) is a sequence, so this function stores the result (return a new ) collection
You can do it with clojure/core.async timeout function http://clojure.github.io/core.async/#clojure.core.async/timeout, you can take a look at https://github.com/clojure/core.async/blob/master/examples/walkthrough.clj to get acquainted with the library
Just use a conditional (if, when ...) to check the count of the collection
As #tangrammer says, core-async would be a good way to go, or Lamina (sample-every)
I pieced something together using a single atom. You probably have to adjust things to your need (e.g. parallel execution, not using a future to create the periodic processing thread, return values, ...). The following code creates processor-with-interval-and-threshold, a function creating another function that can be given a seq of elements which is processed in the way you described.
(defn- periodically!
[interval f]
(future
(while true
(Thread/sleep interval)
(f))))
(defn- build-head-and-tail
[{:keys [head tail]} n elements]
(let [[a b] (->> (concat tail elements)
(split-at n))]
{:head (concat head a) :tail b}))
(defn- build-ready-elements
[{:keys [head tail]}]
{:ready (concat head tail)})
(defn processor-with-interval-and-threshold
[interval threshold f]
(let [q (atom {})]
(letfn [(process-elements! []
(let [{:keys [ready]} (swap! q build-ready-elements)]
(when-not (empty? ready)
(f ready))))]
(periodically! interval process-elements!)
(fn [sq]
(let [{:keys [head]} (swap! q build-head-and-tail threshold sq)]
(when (>= (count head) threshold)
(process-elements!)))))))
The atom q manages a map of three elements:
:head: a seq that gets filled first and checked against the threshold,
:tail: a seq with the elements that exceed the threshold (possibly lazy),
:ready: elements to be processed.
Now, you could for example do the following:
(let [add! (processor-with-interval-and-threshold 300000 10 your-fn)]
(doseq [x (line-seq (java.io.BufferedReader. *in*))]
(add! [x])))
That should be enough to get you started, I guess.