Iterator blocks in Clojure? - clojure

I am using clojure.contrib.sql to fetch some records from an SQLite database.
(defn read-all-foo []
(with-connection *db*
(with-query-results res ["select * from foo"]
(into [] res))))
Now, I don't really want to realize the whole sequence before returning from the function (i.e. I want to keep it lazy), but if I return res directly or wrap it some kind of lazy wrapper (for example I want to make a certain map transformation on result sequence), SQL-related bindings will be reset and connection will be closed after I return, so realizing the sequence will throw an exception.
How can I enclose the whole function in a closure and return a kind of iterator block (like yield in C# or Python)?
Or is there another way to return a lazy sequence from this function?

The resultset-seq that with-query-results returns is probably already as lazy as you're going to get. Laziness only works as long as the handle is open, as you said. There's no way around this. You can't read from a database if the database handle is closed.
If you need to do I/O and keep the data after the handle is closed, then open the handle, slurp it in fast (defeating laziness), close the handle, and work with the results afterward. If you want to iterate over some data without keeping it all in memory at once, then open the handle, get a lazy seq on the data, doseq over it, then close the handle.
So if you want to do something with each row (for side-effects) and discard the results without eating the whole resultset into memory, then you could do this:
(defn do-something-with-all-foo [f]
(let [sql "select * from foo"]
(with-connection *db*
(with-query-results res [sql]
(doseq [row res]
(f row))))))
user> (do-something-with-all-foo println)
{:id 1}
{:id 2}
{:id 3}
nil
;; transforming the data as you go
user> (do-something-with-all-foo #(println (assoc % :bar :baz)))
{:id 1, :bar :baz}
{:id 2, :bar :baz}
{:id 3, :bar :baz}
If you want your data to hang around long-term, then you may as well slurp it all in using your read-all-foo function above (thus defeating laziness). If you want to transform the data, then map over the results after you've fetched it all. Your data will all be in memory at that point, but the map call itself and your post-fetch data transformations will be lazy.

It is in fact possible to add a "terminating side-effect" to a lazy sequence, to be executed once, when the entire sequence is consumed for the first time:
(def s (lazy-cat (range 10) (do (println :foo) nil)))
(first s)
; => returns 0, prints out nothing
(doall (take 10 s))
; => returns (0 1 2 3 4 5 6 7 8 9), prints nothing
(last s)
; => returns 9, prints :foo
(doall s)
; => returns (0 1 2 3 4 5 6 7 8 9), prints :foo
; or rather, prints :foo if it it's the first time s has been
; consumed in full; you'll have to redefine it if you called
; (last s) earlier
I'm not sure I'd use this to close a DB connection, though -- I think it's considered best practice not to hold on to a DB connection indefinitely and putting your connection-closing call at the end of your lazy sequence of results would not only hold on to the connection longer than strictly necessary, but also open up the possibility that your programme will fail for an unrelated reason without ever closing the connection. Thus for this scenario, I would normally just slurp in all data. As Brian says, you can store it all somewhere unprocessed, than perform any transformations lazily, so you should be fine as long as you're not trying to pull in a really huge dataset in one chunk.
But then I don't know your exact circumstances, so if it makes sense from your point of view, you can definitely call a connection-closing function at the tail end of your result sequence. As Michiel Borkent points out, you wouldn't be able to use with-connection if you wanted to do this.

I have never used SQLite with Clojure before, but my guess is that with-connection closes the connection when it's body has been evaluated. So you need to manage the connection yourself if you want to keep it open, and close it when you finish reading the elements you're interested in.

There is no way to create a function or macro "on top" of with-connection and with-query-results to add lazyness. Both close the their Connection and ResultSet respectively, when control flow leaves the lexical scope.
As Michal said, it would be no problem to create a lazy seq, closing its ResultSet and Connection lazily. As he also said, it wouldn't be a good idea, unless you can guarantee that the sequences are eventually finished.
A feasible solution might be:
(def *deferred-resultsets*)
(defmacro with-deferred-close [&body]
(binding [*deferred-resultsets* (atom #{})]
(let [ret# (do ~#body)]
;;; close resultsets
ret# ))
(defmacro with-deferred-results [bind-form sql & body]
(let [resultset# (execute-query ...)]
(swap! *deferred-resultsets* conj resultset# )
;;; execute body, similar to with-query-results
;;; but leave resultset open
))
This would allow for e.g. keeping the resultsets open until the current request is finished.

Related

How can I record time for function call in clojure

I am newbie to Clojure. I am invoking Clojure function using java and I want to record the time a particular line of clojure code execution takes:
Suppose if my clojure function is:
(defn sampleFunction [sampleInput]
(fun1 (fun2 sampleInput))
Above function I am invoking from java which returns some String value and I want to record the time it takes for executing fun2.
I have a another function say logTime which will write the parameter passed to it in to some database:
(defn logTime [time]
.....
)
My Question is: How can I modify my sampleFunction(..) to invoke logTime for recording time it took to execute fun2.
Thank you in advance.
I'm not entirely sure how the different pieces of your code fit together and interoperate with Java, but here's something that could work with the way you described it.
To get the execution time of a piece of code, there's a core function called time. However, this function doesn't return the execution time, it just prints it... So given that you want to log that time into a database, we need to write a macro to capture both the return value of fun2 as well the time it took to execute:
(defmacro time-execution
[& body]
`(let [s# (new java.io.StringWriter)]
(binding [*out* s#]
(hash-map :return (time ~#body)
:time (.replaceAll (str s#) "[^0-9\\.]" "")))))
What this macro does is bind standard output to a Java StringWriter, so that we can use it to store whatever the time function prints. To return both the result of fun2 and the time it took to execute, we package the two values in a hash-map (could be some other collection too - we'll end up destructuring it later). Notice that the code whose execution we're timing is wrapped in a call to time, so that we trigger the printing side effect and capture it in s#. Finally, the .replaceAll is just to ensure that we're only extracting the actual numeric value (in miliseconds), since time prints something of the form "Elapsed time: 0.014617 msecs".
Incorporating this into your code, we need to rewrite sampleFunction like so:
(defn sampleFunction [sampleInput]
(let [{:keys [return time]} (time-execution (fun2 sampleInput))]
(logTime time)
(fun1 return)))
We're simply destructuring the hash-map to access both the return value of fun2 and the time it took to execute, then we log the execution time using logTime, and finally we finish by calling fun1 on the return value of fun2.
The library tupelo.prof gives you many options if you want to capture execution time for one or more functions and accumulate it over multiple calls. An example:
(ns tst.demo.core
(:use tupelo.core tupelo.test)
(:require
[tupelo.profile :as prof]))
(defn add2 [x y] (+ x y))
(prof/defnp fast [] (reduce add2 0 (range 10000)))
(prof/defnp slow [] (reduce add2 0 (range 10000000)))
(dotest
(prof/timer-stats-reset)
(dotimes [i 10000] (fast))
(dotimes [i 10] (slow))
(prof/print-profile-stats)
)
with result:
--------------------------------------
Clojure 1.10.2-alpha1 Java 14
--------------------------------------
Testing tst.demo.core
---------------------------------------------------------------------------------------------------
Profile Stats:
Samples TOTAL MEAN SIGMA ID
10000 0.955 0.000096 0.000045 :tst.demo.core/fast
10 0.905 0.090500 0.000965 :tst.demo.core/slow
---------------------------------------------------------------------------------------------------
If you want detailed timing for a single method, the Criterium library is what you need. Start off with the quick-bench function.
Since the accepted answer has some shortcomings around eating up logs etc,
A simpler solution compared to the accepted answer perhaps
(defmacro time-execution [body]
`(let [st# (System/currentTimeMillis)
return# ~body
se# (System/currentTimeMillis)]
{:return return#
:time (double (/ (- se# st#) 1000))}))

Closing a channel at the producer end when all the jobs are finished

For my Mandelbrot explorer project, I need to run several expensive jobs, ideally in parallel. I decided to try chunking the jobs, and running each chunk in its own thread, and end ended up with something like
(defn point-calculator [chunk-size points]
(let [out-chan (chan (count points))
chunked (partition chunk-size points)]
(doseq [chunk chunked]
(thread
(let [processed-chunk (expensive-calculation chunk)]
(>!! out-chan processed-chunk))))
out-chan))
Where points is a list of [real, imaginary] coordinates to be tested, and expensive-calculation is a function that takes the chunk, and tests each point in the chunk. Each chunk can take a long time to finish (potentially a minute or more depending on the chunk size and the number of jobs).
On my consumer end, I'm using
(loop []
(when-let [proc-chunk (<!! result-chan)]
; Do stuff with chunk
(recur)))
To consume each processed chunk. Right now, this blocks when the last chunk is consumed since the channel is still open.
I need a way of closing the channel when the jobs are done. This is proving difficult because of asynchronicity of the producer loop. I can't simply put a close! after the doseq since the loop doesn't block, and I can't just close when the last-indexed job is done, since the order is indeterminate.
The best idea I could come up with was maintaining a (atom #{}) of jobs, and disj each job as it finishes. Then I could either check for the set size in the loop, and close! when it's 0, or attach a watch to the atom and check there.
This seems very hackish though. Is there a more idiomatic way of dealing with this? Does this scenario suggest I'm using async incorrectly?
i would take a look at the take function from core-async. That is what it's documentation says:
"Returns a channel that will return, at most, n items from ch. After n items
have been returned, or ch has been closed, the return channel will close.
"
so it leads you to a simple fix: instead of returning out-chan you can just wrap it into take:
(clojure.core.async/take (count chunked) out-chan)
that should work.
Also i would recommend you to rewrite your example from blocking put/get to parking (<!, >!) and thread to go / go-loop which is more idiomatic usage for core async.
You may want to use async/pipeline(-blocking) to control parallelisms. And use aysnc/onto-chan to close the input channel automatically after all the chunks are copied.
E.g. below example shows a 16x improvement on elapsed time when parallelisms is set to 16.
(defn expensive-calculation [pts]
(Thread/sleep 100)
(reduce + pts))
(time
(let [points (take 10000 (repeatedly #(rand 100)))
chunk-size 500
inp-chan (chan)
out-chan (chan)]
(go-loop [] (when-let [res (<! out-chan)]
;; do stuff with chunk
(recur)))
(pipeline-blocking 16 out-chan (map expensive-calculation) inp-chan)
(<!! (onto-chan inp-chan (partition-all chunk-size points)))))

flushing the content of a core.async channel

Consider a core.async channel which is created like so:
(def c (chan))
And let's assume values are put and taken to this channel from different places (eg. in go-loops).
How would one flush all the items on the channel at a certain time?
For instance one could make the channel an atom and then have an event like this:
(def c (atom (chan))
(defn reset []
(close! #c)
(reset! c (chan)))
Is there another way to do so?
Read everything to a vector with into and don't use it.
(go (async/into [] c))
Let's define a little more clearly what you seem to want to do: you have code running in several go-loops, each of them putting data on the same channel. You want to be able to tell them all: "the channel you're putting values on is no good anymore; from now on, put your values on some other channel." If that's not what you want to do, then your original question doesn't make much sense, as there's no "flushing" to be done -- you either take the values being put on the channel, or you don't.
First, understand the reason your approach won't work, which the comments to your question touch on: if you deref an atom c, you get a channel, and that value is always the same channel. You have code in go-loops that have called >! and are currently parked, waiting for takers. When you close #c, those parked threads stay parked (anyone parked while taking from a channel (<!) will immediately get the value nil when the channel closes, but parked >!s will simply stay parked). You can reset! c all day long, but the parked threads are still parked on a previous value they got from derefing.
So, how do you do it? Here's one approach.
(require '[clojure.core.async :as a
:refer [>! <! >!! <!! alt! take! go-loop chan close! mult tap]])
(def rand-int-chan (chan))
(def control-chan (chan))
(def control-chan-mult (mult control-chan))
(defn create-worker
[put-chan control-chan worker-num]
(go-loop [put-chan put-chan]
(alt!
[[put-chan (rand-int 10)]]
([_ _] (println (str "Worker" worker-num " generated value."))
(recur put-chan))
control-chan
([new-chan] (recur new-chan)))))
(defn create-workers
[n c cc]
(dotimes [n n]
(let [tap-chan (chan)]
(a/tap cc tap-chan)
(create-worker c tap-chan n))))
(create-workers 5 rand-int-chan control-chan-mult)
So we are going to create 5 worker loops that will put their result on rand-int-chan, and we will give them a "control channel." I will let you explore mult and tap on your own, but in short, we are creating a single channel which we can put values on, and that value is then broadcast to all channels which tap it.
In our worker loop, we do one of two things: put a value onto the rand-int-chan that we use when we create it, or we will take a value off of this control channel. We can cleverly let the worker thread know that the channel to put its values on has changed by actually handing it the new channel, which it will then bind on the next time through the loop. So, to see it in action:
(<!! rand-int-chan)
=> 6
Worker2 generated value.
This will take random ints from the channel, and the worker thread will print that it has generated a value, to see that indeed multiple threads are participating here.
Now, let's say we want to change the channel to put the random integers on. No problem, we do:
(def new-rand-int-chan (chan))
(>!! control-chan new-rand-int-chan)
(close! rand-int-chan) ;; for good measure, may not be necessary
We create the channel, and then we put that channel onto our control-chan. When we do this, ever worker thread will have the second portion of its alt! executed, which simply loops back to the top of the go-loop, except this time, the put-chan will be bound to the new-rand-int-chan we just received. So now:
(<!! new-rand-int-chan)
=> 3
Worker1 generated value.
This gives us our integers, which is exactly what we want. Any attempt to <!! from the old channel will give nil, since we closed the channel:
(<!! rand-int-chan)
; nil

How do clojure core.async channels get cleaned up?

I'm looking at Clojure core.async for the first time, and was going through this excellent presentation by Rich Hickey: http://www.infoq.com/presentations/clojure-core-async
I had a question about the example he shows at the end of his presentation:
According to Rich, this example basically tries to get a web, video, and image result for a specific query. It tries two different sources in parallel for each of those results, and just pulls out the fastest result for each. And the entire operation can take no more than 80ms, so if we can't get e.g. an image result in 80ms, we'll just give up. The 'fastest' function creates and returns a new channel, and starts two go processes racing to retrieve a result and put it on the channel. Then we just take the first result off of the 'fastest' channel and slap it onto the c channel.
My question: what happens to these three temporary, unnamed 'fastest' channels after we take their first result? Presumably there is still a go process which is parked trying to put the second result onto the channel, but no one is listening so it never actually completes. And since the channel is never bound to anything, it doesn't seem like we have any way of doing anything with it ever again. Will the go process & channel "realize" that no one cares about their results any more and clean themselves up? Or did we essentially just "leak" three channels / go processes in this code?
There is no leak.
Parked gos are attached to channels on which they attempted to perform an operation and have no independent existence beyond that. If other code loses interest in the channels a certain go is parked on (NB. a go can simultaneously become a putter/taker on many channels if it parks on alt! / alts!), then eventually it'll be GC'd along with those channels.
The only caveat is that in order to be GC'd, gos actually have to park first. So any go that keeps doing stuff in a loop without ever parking (<! / >! / alt! / alts!) will in fact live forever. It's hard to write this sort of code by accident, though.
Caveats and exceptions aside, you can test garbage collection on the JVM at the REPL.
eg:
(require '[clojure.core.async :as async])
=> nil
(def c (async/chan))
=> #'user/c
(def d (async/go-loop []
(when-let [v (async/<! c)]
(println v)
(recur))))
=> #'user/d
(async/>!! c :hi)
=> true
:hi ; core.async go block is working
(import java.lang.ref.WeakReference)
=> java.lang.ref.WeakReference ; hold a reference without preventing garbage collection
(def e (WeakReference. c))
=> #'user/e
(def f (WeakReference. d))
=> #'user/f
(.get e)
=> #object[...]
(.get f)
=> #object[...]
(def c nil)
=> #'user/c
(def d nil)
=> #'user/d
(println "We need to clear *1, *2 and *3 in the REPL.")
We need to clear *1, *2 and *3 in the REPL.
=> nil
(println *1 *2 *3)
nil #'user/d #'user/c
=> nil
(System/gc)
=> nil
(.get e)
=> nil
(.get f)
=> nil
What just happened? I setup a go block and checked it was working. Then used a WeakReference to observe the communication channel (c) and the go block return channel (d). Then I removed all references to c and d (including *1, *2 and *3 created by my REPL), requested garbage collection, (and got lucky, the System.gc Javadoc does not make strong guarantees) and then observed that my weak references had been cleared.
In this case at least, once references to the channels involved had been removed, the channels were garbage collected (regardless of my failure to close them!)
Assumedly a channel produced by fastest only returns the result of the fastest query method and then closes.
If a second result was produced, your assumption could hold that the fastest processeses are leaked. Their results are never consumed. If they relied on all their results to be consumed to terminate, they wouldn't terminate.
Notice that this could also happen if the channel t is selected in the alt! clause.
The usualy way to fix this would be to close the channel c in the last go block with close!. Puts made to a closed channel will then be dropped then and the producers can terminate.
The problem could also be solved in the implementation of fastest. The process created in fastest could itself make the put via alts! and timeout and terminate if the produced values are not consumed within a certain amount of time.
I guess Rich did not address the problem in the slide in favor of a less lengthy example.

Infinite lazy-sequence of events from external feed

Say I have a function, (get-events "feed"), that returns a vector of events in chronological order, taken from an external source.
Now, at any given moment, that function returns a list of events up to that point in time. Called a few seconds later, it will return a few more events, etc, as the feed continually grows.
If I want to create a lazy-seq that forever pulls new events from the feed, making sure it doesn't repeat those that have already been seen, how would I write this? I'm running into a stack overflow error when I don't use recur, but I can't use recur, because it doesn't appear in a tail position.
(def continually-list-events
([feed] (continually-list-events feed (hash-set)))
([feed seen]
(let [events-now (get-events feed)]
(into (remove seen events-now)
(lazy-seq
(continually-list-events feed
(into seen events-now))))))
You can see I'm trying to use an accumulator to track events already seen (in a set), and I'm making sure to always filter out the ones I've seen.
If each step keeps track of how many events have been received so far then that iteration can return a sequence of new events by dropping the old ones.
user> (->> (iterate (fn [[events-so-far contents]]
(let [events (get-events)
new-events (drop events-so-far events)]
[(count events) new-events])))
(mapcat second))
Then you can drop the counts from the sequence and flatten the chunks of events into a sequence of single events.
In your example the stackoverflow is because there is no call to cons after the call to lazy-seq so it's calculating the whole list as the first item in the sequence.
user> (defn example [x] (lazy-seq (cons x (example (inc x)))))
#'user/example
user> (take 5 (example 4))
(4 5 6 7 8)
user> (defn example [x] (lazy-seq (example (inc x))))
#'user/example
user> (take 5 (example 4))
... long pause then out of memory ...
PS: using lazy-seq directly is somewhat uncommon, though it's important to know how it works.