I'm experimenting with filtering through elements in parallel. For each element, I need to perform a distance calculation to see if it is close enough to a target point. Never mind that data structures already exist for doing this, I'm just doing initial experiments for now.
Anyway, I wanted to run some very basic experiments where I generate random vectors and filter them. Here's my implementation that does all of this
(defn pfilter [pred coll]
(map second
(filter first
(pmap (fn [item] [(pred item) item]) coll))))
(defn random-n-vector [n]
(take n (repeatedly rand)))
(defn distance [u v]
(Math/sqrt (reduce + (map #(Math/pow (- %1 %2) 2) u v))))
(defn -main [& args]
(let [[n-str vectors-str threshold-str] args
n (Integer/parseInt n-str)
vectors (Integer/parseInt vectors-str)
threshold (Double/parseDouble threshold-str)
random-vector (partial random-n-vector n)
u (random-vector)]
(time (println n vectors
(count
(pfilter
(fn [v] (< (distance u v) threshold))
(take vectors (repeatedly random-vector))))))))
The code executes and returns what I expect, that is the parameter n (length of vectors), vectors (the number of vectors) and the number of vectors that are closer than a threshold to the target vector. What I don't understand is why the programs hangs for an additional minute before terminating.
Here is the output of a run which demonstrates the error
$ time lein run 10 100000 1.0
[null] 10 100000 12283
[null] "Elapsed time: 3300.856 msecs"
real 1m6.336s
user 0m7.204s
sys 0m1.495s
Any comments on how to filter in parallel in general are also more than welcome, as I haven't yet confirmed that pfilter actually works.
You need to call shutdown-agents to kill the threads backing the threadpool used by pmap.
About pfilter, it should work but run slower than filter, since your predicate is simple. Parallelization isn't free so you have to give each thread moderately intensive tasks to offset the multithreading overhead. Batch your items before filtering them.
Related
I am newbie to Clojure. I am invoking Clojure function using java and I want to record the time a particular line of clojure code execution takes:
Suppose if my clojure function is:
(defn sampleFunction [sampleInput]
(fun1 (fun2 sampleInput))
Above function I am invoking from java which returns some String value and I want to record the time it takes for executing fun2.
I have a another function say logTime which will write the parameter passed to it in to some database:
(defn logTime [time]
.....
)
My Question is: How can I modify my sampleFunction(..) to invoke logTime for recording time it took to execute fun2.
Thank you in advance.
I'm not entirely sure how the different pieces of your code fit together and interoperate with Java, but here's something that could work with the way you described it.
To get the execution time of a piece of code, there's a core function called time. However, this function doesn't return the execution time, it just prints it... So given that you want to log that time into a database, we need to write a macro to capture both the return value of fun2 as well the time it took to execute:
(defmacro time-execution
[& body]
`(let [s# (new java.io.StringWriter)]
(binding [*out* s#]
(hash-map :return (time ~#body)
:time (.replaceAll (str s#) "[^0-9\\.]" "")))))
What this macro does is bind standard output to a Java StringWriter, so that we can use it to store whatever the time function prints. To return both the result of fun2 and the time it took to execute, we package the two values in a hash-map (could be some other collection too - we'll end up destructuring it later). Notice that the code whose execution we're timing is wrapped in a call to time, so that we trigger the printing side effect and capture it in s#. Finally, the .replaceAll is just to ensure that we're only extracting the actual numeric value (in miliseconds), since time prints something of the form "Elapsed time: 0.014617 msecs".
Incorporating this into your code, we need to rewrite sampleFunction like so:
(defn sampleFunction [sampleInput]
(let [{:keys [return time]} (time-execution (fun2 sampleInput))]
(logTime time)
(fun1 return)))
We're simply destructuring the hash-map to access both the return value of fun2 and the time it took to execute, then we log the execution time using logTime, and finally we finish by calling fun1 on the return value of fun2.
The library tupelo.prof gives you many options if you want to capture execution time for one or more functions and accumulate it over multiple calls. An example:
(ns tst.demo.core
(:use tupelo.core tupelo.test)
(:require
[tupelo.profile :as prof]))
(defn add2 [x y] (+ x y))
(prof/defnp fast [] (reduce add2 0 (range 10000)))
(prof/defnp slow [] (reduce add2 0 (range 10000000)))
(dotest
(prof/timer-stats-reset)
(dotimes [i 10000] (fast))
(dotimes [i 10] (slow))
(prof/print-profile-stats)
)
with result:
--------------------------------------
Clojure 1.10.2-alpha1 Java 14
--------------------------------------
Testing tst.demo.core
---------------------------------------------------------------------------------------------------
Profile Stats:
Samples TOTAL MEAN SIGMA ID
10000 0.955 0.000096 0.000045 :tst.demo.core/fast
10 0.905 0.090500 0.000965 :tst.demo.core/slow
---------------------------------------------------------------------------------------------------
If you want detailed timing for a single method, the Criterium library is what you need. Start off with the quick-bench function.
Since the accepted answer has some shortcomings around eating up logs etc,
A simpler solution compared to the accepted answer perhaps
(defmacro time-execution [body]
`(let [st# (System/currentTimeMillis)
return# ~body
se# (System/currentTimeMillis)]
{:return return#
:time (double (/ (- se# st#) 1000))}))
For my Mandelbrot explorer project, I need to run several expensive jobs, ideally in parallel. I decided to try chunking the jobs, and running each chunk in its own thread, and end ended up with something like
(defn point-calculator [chunk-size points]
(let [out-chan (chan (count points))
chunked (partition chunk-size points)]
(doseq [chunk chunked]
(thread
(let [processed-chunk (expensive-calculation chunk)]
(>!! out-chan processed-chunk))))
out-chan))
Where points is a list of [real, imaginary] coordinates to be tested, and expensive-calculation is a function that takes the chunk, and tests each point in the chunk. Each chunk can take a long time to finish (potentially a minute or more depending on the chunk size and the number of jobs).
On my consumer end, I'm using
(loop []
(when-let [proc-chunk (<!! result-chan)]
; Do stuff with chunk
(recur)))
To consume each processed chunk. Right now, this blocks when the last chunk is consumed since the channel is still open.
I need a way of closing the channel when the jobs are done. This is proving difficult because of asynchronicity of the producer loop. I can't simply put a close! after the doseq since the loop doesn't block, and I can't just close when the last-indexed job is done, since the order is indeterminate.
The best idea I could come up with was maintaining a (atom #{}) of jobs, and disj each job as it finishes. Then I could either check for the set size in the loop, and close! when it's 0, or attach a watch to the atom and check there.
This seems very hackish though. Is there a more idiomatic way of dealing with this? Does this scenario suggest I'm using async incorrectly?
i would take a look at the take function from core-async. That is what it's documentation says:
"Returns a channel that will return, at most, n items from ch. After n items
have been returned, or ch has been closed, the return channel will close.
"
so it leads you to a simple fix: instead of returning out-chan you can just wrap it into take:
(clojure.core.async/take (count chunked) out-chan)
that should work.
Also i would recommend you to rewrite your example from blocking put/get to parking (<!, >!) and thread to go / go-loop which is more idiomatic usage for core async.
You may want to use async/pipeline(-blocking) to control parallelisms. And use aysnc/onto-chan to close the input channel automatically after all the chunks are copied.
E.g. below example shows a 16x improvement on elapsed time when parallelisms is set to 16.
(defn expensive-calculation [pts]
(Thread/sleep 100)
(reduce + pts))
(time
(let [points (take 10000 (repeatedly #(rand 100)))
chunk-size 500
inp-chan (chan)
out-chan (chan)]
(go-loop [] (when-let [res (<! out-chan)]
;; do stuff with chunk
(recur)))
(pipeline-blocking 16 out-chan (map expensive-calculation) inp-chan)
(<!! (onto-chan inp-chan (partition-all chunk-size points)))))
In the same way alt! waits for one of n channels to get a value, I'm looking for the idiomatic way to wait for all n channels to get a value.
I need this because I "spawn" n go blocks to work on async tasks, and I want to know when they are all done. I'm sure there is a very beautiful way to achieve this.
Use the core.async map function:
(<!! (a/map vector [ch1 ch2 ch3]))
;; [val-from-ch-1 val-from-ch2 val-from-ch3]
You can say (mapv #(async/<!! %) channels).
If you wanted to handle individual values as they arrive, and then do something special after the final channel produces a value, you can use exploit the fact that alts! / alts!! take a vector of channels, and they are functions, not macros, so you can easily pass in dynamically constructed vectors.
So, you can use alts!! to wait on your initial collection of n channels, then use it again on the remaining channels etc.
(def c1 (async/chan))
(def c2 (async/chan))
(def out
(async/thread
(loop [cs [c1 c2] vs []]
(let [[v p] (async/alts!! cs)
cs (filterv #(not= p %) cs)
vs (conj vs v)]
(if (seq cs)
(recur cs vs)
vs)))))
(async/>!! c1 :foo)
(async/>!! c2 :bar)
(async/<!! out)
;= [:foo :bar]
If instead you wanted to take all values from all the input channels and then do something else when they all close, you'd want to use async/merge:
clojure.core.async/merge
([chs] [chs buf-or-n])
Takes a collection of source channels and returns a channel which
contains all values taken from them. The returned channel will be
unbuffered by default, or a buf-or-n can be supplied. The channel
will close after all the source channels have closed.
Say I have a function, (get-events "feed"), that returns a vector of events in chronological order, taken from an external source.
Now, at any given moment, that function returns a list of events up to that point in time. Called a few seconds later, it will return a few more events, etc, as the feed continually grows.
If I want to create a lazy-seq that forever pulls new events from the feed, making sure it doesn't repeat those that have already been seen, how would I write this? I'm running into a stack overflow error when I don't use recur, but I can't use recur, because it doesn't appear in a tail position.
(def continually-list-events
([feed] (continually-list-events feed (hash-set)))
([feed seen]
(let [events-now (get-events feed)]
(into (remove seen events-now)
(lazy-seq
(continually-list-events feed
(into seen events-now))))))
You can see I'm trying to use an accumulator to track events already seen (in a set), and I'm making sure to always filter out the ones I've seen.
If each step keeps track of how many events have been received so far then that iteration can return a sequence of new events by dropping the old ones.
user> (->> (iterate (fn [[events-so-far contents]]
(let [events (get-events)
new-events (drop events-so-far events)]
[(count events) new-events])))
(mapcat second))
Then you can drop the counts from the sequence and flatten the chunks of events into a sequence of single events.
In your example the stackoverflow is because there is no call to cons after the call to lazy-seq so it's calculating the whole list as the first item in the sequence.
user> (defn example [x] (lazy-seq (cons x (example (inc x)))))
#'user/example
user> (take 5 (example 4))
(4 5 6 7 8)
user> (defn example [x] (lazy-seq (example (inc x))))
#'user/example
user> (take 5 (example 4))
... long pause then out of memory ...
PS: using lazy-seq directly is somewhat uncommon, though it's important to know how it works.
I've defined a simple factorial function in the REPL:
(defn factorial [n]
(loop [current n fact 1]
(if
(= current 1)
fact
(recur (dec current) (* current fact)))))
The function works fine. But when I try to call the function multiple times with a dotimes loop the REPL seems to stop working. I don't get any results back anymore for whatever expression I type and have to restart the REPL.
I loop with:
(dotimes [x 10]
(println "Factorial of " x " is " (factorial x)))
I'm using IntelliJ with the La Clojure plugin (Clojure version 1.3.0).
I bet it takes an awfully long time to compute (factorial 0) with that function definition...