haven't done any Clojure for couple years, so decided to go back and not ignore core.async this time around ) pretty cool stuff, that - but it surprised me almost immediately. Now, I understand that there's inherent indeterminism when multiple threads are involved, but there's something bigger than that at play here.
The source code for my oh-so-simple example, where I am trying to copy lines from STDIN to a file:
(defn append-to-file
"Write a string to the end of a file"
([filename s]
(spit filename (str s "\n")
:append true))
([s]
(append-to-file "/tmp/journal.txt" s)))
(defn -main
"I don't do a whole lot ... yet."
[& args]
(println "Initializing..")
(let [out-chan (a/chan)]
(loop [line (read-line)]
(if (empty? line) :ok
(do
(go (>! out-chan line))
(go (append-to-file (<! out-chan)))
(recur (read-line)))))))
except, of course, this turned out to be not so simple. I think I've narrowed it down to something that's not properly cleaned up. Basically, running the main function produces inconsistent results. Sometimes I run it 4 times, and see 12 lines in the output. But sometimes, 4 run will produce just 10 lines. Or like below, 3 times, 6 lines:
akamac.home ➜ coras git:(master) ✗ make clean
cat /dev/null > /tmp/journal.txt
lein clean
akamac.home ➜ coras git:(master) ✗ make compile
lein uberjar
Compiling coras.core
Created /Users/akarpov/repos/coras/target/uberjar/coras-0.1.0-SNAPSHOT.jar
Created /Users/akarpov/repos/coras/target/uberjar/coras-0.1.0-SNAPSHOT-standalone.jar
akamac.home ➜ coras git:(master) ✗ make run
java -jar target/uberjar/coras-0.1.0-SNAPSHOT-standalone.jar < resources/input.txt
Initializing..
akamac.home ➜ coras git:(master) ✗ make run
java -jar target/uberjar/coras-0.1.0-SNAPSHOT-standalone.jar < resources/input.txt
Initializing..
akamac.home ➜ coras git:(master) ✗ make run
java -jar target/uberjar/coras-0.1.0-SNAPSHOT-standalone.jar < resources/input.txt
Initializing..
akamac.home ➜ coras git:(master) ✗ make check
cat /tmp/journal.txt
line a
line z
line b
line a
line b
line z
(Basically, sometimes a run produced 3 lines, sometimes 0, sometimes 1 or 2).
The fact that lines appear in random order doesn't bother me - go blocks do things in a concurrent/threaded manner, and all bets are off. But why they don't do all of work all the time? (Because I am misusing them somehow, but where?)
Thanks!
There's many problems with this code, let me walk through them real quick:
1) Every time you call (go ...) you're spinning of a new "thread" that will be executed in a thread pool. It is undefined when this thread will run.
2) You aren't waiting for the completion of these threads, so it's possible (and very likely) that you will end up reading several lines form the file, writing several lines to the channel, before a read even occurs.
3) You are firing off multiple calls to append-to-file at the same time (see #2) these functions are not synchronized, so it's possible that multiple threads will append at once. Since access to files in most OSes is uncoordinated it's possible for two threads to write to your file at the same time, overwriting eachother's results.
4) Since you are creating a new go block for every line read, it's possible they will execute in a different order than you expect, this means the lines in the output file may be out of order.
I think all this can be fixed a bit by avoiding a rather common anti-pattern with core.async: don't create go blocks (or threads) inside unbounded or large loops. Often this is doing something you don't expect. Instead create one core.async/thread with a loop that reads from the file (since it's doing IO, never do IO inside a go block) and writes to the channel, and one that reads from the channel and writes to the output file.
View this as an assembly line build out of workers (go blocks) and conveyor belts (channels). If you built a factory you wouldn't have a pile of people and pair them up saying "you take one item, when you're done hand it to him". Instead you'd organize all the people once, with conveyors between them and "flow" the work (or data) between the workers. Your workers should be static, and your data should be moving.
.. and of course, this was a misuse of core.async on my part:
If I care about seeing all the data in the output, I must use a blocking 'take' on the channel, when I want to pass the value to my I/O code -- and, as it was pointed out, that blocking call should not be inside a go block. A single line change was all I needed:
from:
(go (append-to-file (<! out-chan)))
to:
(append-to-file (<!! out-chan))
Related
Consider a core.async channel which is created like so:
(def c (chan))
And let's assume values are put and taken to this channel from different places (eg. in go-loops).
How would one flush all the items on the channel at a certain time?
For instance one could make the channel an atom and then have an event like this:
(def c (atom (chan))
(defn reset []
(close! #c)
(reset! c (chan)))
Is there another way to do so?
Read everything to a vector with into and don't use it.
(go (async/into [] c))
Let's define a little more clearly what you seem to want to do: you have code running in several go-loops, each of them putting data on the same channel. You want to be able to tell them all: "the channel you're putting values on is no good anymore; from now on, put your values on some other channel." If that's not what you want to do, then your original question doesn't make much sense, as there's no "flushing" to be done -- you either take the values being put on the channel, or you don't.
First, understand the reason your approach won't work, which the comments to your question touch on: if you deref an atom c, you get a channel, and that value is always the same channel. You have code in go-loops that have called >! and are currently parked, waiting for takers. When you close #c, those parked threads stay parked (anyone parked while taking from a channel (<!) will immediately get the value nil when the channel closes, but parked >!s will simply stay parked). You can reset! c all day long, but the parked threads are still parked on a previous value they got from derefing.
So, how do you do it? Here's one approach.
(require '[clojure.core.async :as a
:refer [>! <! >!! <!! alt! take! go-loop chan close! mult tap]])
(def rand-int-chan (chan))
(def control-chan (chan))
(def control-chan-mult (mult control-chan))
(defn create-worker
[put-chan control-chan worker-num]
(go-loop [put-chan put-chan]
(alt!
[[put-chan (rand-int 10)]]
([_ _] (println (str "Worker" worker-num " generated value."))
(recur put-chan))
control-chan
([new-chan] (recur new-chan)))))
(defn create-workers
[n c cc]
(dotimes [n n]
(let [tap-chan (chan)]
(a/tap cc tap-chan)
(create-worker c tap-chan n))))
(create-workers 5 rand-int-chan control-chan-mult)
So we are going to create 5 worker loops that will put their result on rand-int-chan, and we will give them a "control channel." I will let you explore mult and tap on your own, but in short, we are creating a single channel which we can put values on, and that value is then broadcast to all channels which tap it.
In our worker loop, we do one of two things: put a value onto the rand-int-chan that we use when we create it, or we will take a value off of this control channel. We can cleverly let the worker thread know that the channel to put its values on has changed by actually handing it the new channel, which it will then bind on the next time through the loop. So, to see it in action:
(<!! rand-int-chan)
=> 6
Worker2 generated value.
This will take random ints from the channel, and the worker thread will print that it has generated a value, to see that indeed multiple threads are participating here.
Now, let's say we want to change the channel to put the random integers on. No problem, we do:
(def new-rand-int-chan (chan))
(>!! control-chan new-rand-int-chan)
(close! rand-int-chan) ;; for good measure, may not be necessary
We create the channel, and then we put that channel onto our control-chan. When we do this, ever worker thread will have the second portion of its alt! executed, which simply loops back to the top of the go-loop, except this time, the put-chan will be bound to the new-rand-int-chan we just received. So now:
(<!! new-rand-int-chan)
=> 3
Worker1 generated value.
This gives us our integers, which is exactly what we want. Any attempt to <!! from the old channel will give nil, since we closed the channel:
(<!! rand-int-chan)
; nil
I have some core.async code with a pipeline of two chans and three nodes :
a producer - function that puts values into chan1 with >!! (it's not in a go-block but the function is called from inside a go-loop)
a filter - another function that's not in a go-block but is called within a go-loop, which pulls items from chan1 (with <!!), does a test and if the test passes pushes them onto chan2 (with >!!)
a consumer - an ordinary loop that pulls n values of chan2 with
This code works as expected when I run it as a simple program. But when I copy and paste it to work within a unit-test, it freezes up.
My test code is roughly
(deftest a-test
(testing "blah"
(is (= (let [c1 (chan)
c2 (chan)
gen (make-generator c1)
filt (make-filter c1 c2)
result (collector c2 10) ]
result)
[0 2 4 6 8 10 12 14 16 18 20]))
))
where the generator creates a sequence of integers counting up from zero and the filter tests for evenness.
As far as I can tell, the filter is able to pull the first value from the c1, but is blocked waiting for a second value. Meanwhile, the generator is blocking while waiting to push its next value into c1.
But this doesn't happen when I run the code in a simple stand-alone program.
So, is there any reason that the unit-test framework might be interfering or causing problems with the threading management that core.async is providing? Is it possible to do unit-testing on async code like this?
I'm concerned that I'm not running the collector in any kind of go-block or go-loop so presumably it might be blocking the main thread. But equally, I presume I have to pull all the data back into the main thread eventually. And if not through that mechanism, how?
While using blocking IO within go-blocks/go-loops isn't the best solution, thread macro may be better fit here. It will execute passed body on separate thread, so you may freely use blocking operations there.
I have two unidirectional core.async channels :
channel out can only put!
channel in can only take!
And since this is ClojureScript the blocking operations are not available. I would like to make one bidirectional (in-out) channel out of those two (in and out).
(def in (async/chan))
(def out (async/chan))
(def in-out (io-chan in out)) ;; io or whatever the solution is
(async/put! in "test")
(async/take! ch (fn [x] (println x))) ;; should print "test"
(async/put! ch) ;; put into ch is equivalent to putting into `out`
I tried something like the following (not working) :
(defn io-chan [in-ch out-ch]
(let [io (chan)]
(go-loop []
(>! out-ch (<! io ))
(>! io (<! in-ch))
(recur))
io))
A schema might help :
out in-out
---------------> (unused)
<--------------- <---------------
in
----------------> ---------------->
<---------------- (unused)
Also, closing the bidirectional channel should close both underlying channels.
Is is possible ?
If I understand your use case right, I believe what you're trying to do is just a one-channel job.
On the other hand, if what you're trying to do is to present a channel-like interface for a composite of several channels (e.g some process takes data from in, processes it, and outputs the result to out), then you could always implement the right protocols (in the case of ClojureScript, cljs.core.async.impl.protocols/ReadPort and cljs.core.async.impl.protocols/WritePort).
I would personnaly not recommend it. Leaving aside the fact that you'd be relying on implementation details, I don't believe core.async channels are intended as encapsulation for processes, only as communication points between them. So in this use case, just pass the input channel to producers and the output channel to consumers.
Your example shows a flow basicly like this:
io ---> out-ch ---> worker ---> in-ch ---> io
^-------------------------------------------*
If we assume that worker reads from in-ch and writes to out-ch then perhaps these two channels are reversed in the example. if worker does the opposite then it's correct. in order to prevent loops it's important that you use non-buffered queues so you don't hear your own messages echoed back to yourself.
as a side note, there is no such thing as unidirectional and bi-directional channels. instead there are buffered and unbufferd channels. If we are talking over a buffered channel then when I have something to say to you, I park until you happen to be listening to the channel, then once you are ready to hear it I put my message into the channel and you receive it. Then to get a response I park until you are ready to send it, and once you are, you put it on the channel and I get it from the channel (all at once). This feels like a bi-directional channel though it's really just that unbuffered channels happen to coordinate this way.
If the channel if buffered then I might get my own message back from the channel, because I would finish putting it on the channel and then be ready to receive the response before you where even ready to receive the original message. If you need to use buffered channels like this then use two of them, one for each direction and they will "feel" like uni-directional channels.
I'm looking at Clojure core.async for the first time, and was going through this excellent presentation by Rich Hickey: http://www.infoq.com/presentations/clojure-core-async
I had a question about the example he shows at the end of his presentation:
According to Rich, this example basically tries to get a web, video, and image result for a specific query. It tries two different sources in parallel for each of those results, and just pulls out the fastest result for each. And the entire operation can take no more than 80ms, so if we can't get e.g. an image result in 80ms, we'll just give up. The 'fastest' function creates and returns a new channel, and starts two go processes racing to retrieve a result and put it on the channel. Then we just take the first result off of the 'fastest' channel and slap it onto the c channel.
My question: what happens to these three temporary, unnamed 'fastest' channels after we take their first result? Presumably there is still a go process which is parked trying to put the second result onto the channel, but no one is listening so it never actually completes. And since the channel is never bound to anything, it doesn't seem like we have any way of doing anything with it ever again. Will the go process & channel "realize" that no one cares about their results any more and clean themselves up? Or did we essentially just "leak" three channels / go processes in this code?
There is no leak.
Parked gos are attached to channels on which they attempted to perform an operation and have no independent existence beyond that. If other code loses interest in the channels a certain go is parked on (NB. a go can simultaneously become a putter/taker on many channels if it parks on alt! / alts!), then eventually it'll be GC'd along with those channels.
The only caveat is that in order to be GC'd, gos actually have to park first. So any go that keeps doing stuff in a loop without ever parking (<! / >! / alt! / alts!) will in fact live forever. It's hard to write this sort of code by accident, though.
Caveats and exceptions aside, you can test garbage collection on the JVM at the REPL.
eg:
(require '[clojure.core.async :as async])
=> nil
(def c (async/chan))
=> #'user/c
(def d (async/go-loop []
(when-let [v (async/<! c)]
(println v)
(recur))))
=> #'user/d
(async/>!! c :hi)
=> true
:hi ; core.async go block is working
(import java.lang.ref.WeakReference)
=> java.lang.ref.WeakReference ; hold a reference without preventing garbage collection
(def e (WeakReference. c))
=> #'user/e
(def f (WeakReference. d))
=> #'user/f
(.get e)
=> #object[...]
(.get f)
=> #object[...]
(def c nil)
=> #'user/c
(def d nil)
=> #'user/d
(println "We need to clear *1, *2 and *3 in the REPL.")
We need to clear *1, *2 and *3 in the REPL.
=> nil
(println *1 *2 *3)
nil #'user/d #'user/c
=> nil
(System/gc)
=> nil
(.get e)
=> nil
(.get f)
=> nil
What just happened? I setup a go block and checked it was working. Then used a WeakReference to observe the communication channel (c) and the go block return channel (d). Then I removed all references to c and d (including *1, *2 and *3 created by my REPL), requested garbage collection, (and got lucky, the System.gc Javadoc does not make strong guarantees) and then observed that my weak references had been cleared.
In this case at least, once references to the channels involved had been removed, the channels were garbage collected (regardless of my failure to close them!)
Assumedly a channel produced by fastest only returns the result of the fastest query method and then closes.
If a second result was produced, your assumption could hold that the fastest processeses are leaked. Their results are never consumed. If they relied on all their results to be consumed to terminate, they wouldn't terminate.
Notice that this could also happen if the channel t is selected in the alt! clause.
The usualy way to fix this would be to close the channel c in the last go block with close!. Puts made to a closed channel will then be dropped then and the producers can terminate.
The problem could also be solved in the implementation of fastest. The process created in fastest could itself make the put via alts! and timeout and terminate if the produced values are not consumed within a certain amount of time.
I guess Rich did not address the problem in the slide in favor of a less lengthy example.
I've written some simple clojure code that accesses the twitter streaming api. My code is essentially the same as the example code described in the twitter-api docs:
(def ^:dynamic *custom-streaming-callback*
(AsyncStreamingCallback. (comp println #(:text %) json/read-json #(str %2))
(comp println response-return-everything)
exception-print))
(defn start-filtering []
(statuses-filter :params {:follow 12345}
:oauth-creds *creds*
:callbacks *custom-streaming-callback*))
I'm following tweets about a specific user and using oauth for authentication (not shown). When I run the start-filtering method and a connection is opened with twitter everything works well for a spell, but if the stream is inactive for a bit (around 30 seconds), i.e. no tweets about this particular user are coming down the pike, the following error occurs:
#<EOFException java.io.EOFException: JSON error (end-of-file)>
I assumed from the twitter docs that when using a streaming connection, twitter keeps the stream open indefinitely. I must be making some incorrect assumptions. I'm currently diving into the clojure twitter-api code to see what's going on, but I thought more eyes would help me figure this out more quickly.
I had the same issue that you have. As you found, the streaming function emits an empty message if no data has been received in the last thirty seconds or so.
Trying to read this as json then causes the EOF exception that you see.
I don't know of any way to prevent these calls. In my case I worked around the issue with a simple conditional that falls back to an empty map when there is no JSON to read.
(if-not (clojure.string/blank? %)
(json/read-str % :key-fn keyword)
{})