In Clojure how can you run a method like pmap and call a function when all items in the collection have been processed?
you can wrap it in a function, that will enforce the mapping and call some callback function after that:
user> (defn pmap-callback [callback f & colls]
(let [res (doall (apply pmap f colls))]
(callback res)
res))
#'user/pmap-callback
user> (pmap-callback #(println "complete!" %)
+ [1 2 3] [4 5 6])
;;=> complete! (5 7 9)
(5 7 9)
From pmap doc string:
Like map, except f is applied in parallel. Semi-lazy in that the
parallel computation stays ahead of the consumption, but doesn't
realize the entire result unless required. Only useful for
computationally intensive functions where the time of f dominates the
coordination overhead.
So the pmap result won't be computed completely until you ask for all the elements of the result sequence. As with other lazy sequences you can force the lazy sequence to be fully realized using doall (if you need to retain the whole evaluated sequence content) or dorun (if you are interested only the side effects done by your mapping function and don't need the evaluated sequence).
To execute a function when all the results have been computed you can just call it directly after the call to dorun or doall.
It's also worth noting that pmap is not a tool to scheduling asynchronous parallel jobs where you could get notified when your processing finishes. For such use cases Clojure core.async might be a better choice.
Related
The clojure reference contains the following comments about transducers, which seem like saying something important about the safety of writing and using transducers:
If you have a new context for applying transducers, there are a few general rules to be aware of:
If a step function returns a reduced value, the transducible process must not supply any more inputs to the step function. The
reduced value must be unwrapped with deref before completion.
A completing process must call the completion operation on the final accumulated value exactly once.
A transducing process must encapsulate references to the function returned by invoking a transducer - these may be stateful and unsafe
for use across threads.
Can you explain, possibly with some examples, what each of these cases mean? also, what does "context" refer to in this context?
Thanks!
If a step function returns a reduced value, the transducible process must not supply any more inputs to the step function. The reduced value must be unwrapped with deref before completion.
One example of this scenario is take-while transducer:
(fn [rf]
(fn
([] (rf))
([result] (rf result))
([result input]
(if (pred input)
(rf result input)
(reduced result)))))
As you can see, it can return a reduced value which means there is no point (and actually it would be an error) to provide more input to such step function - we know already there can be no more values produced.
For example while processing (1 1 3 5 6 8 7) input collection with odd? predicate once we reach value 6 there will be no more values returned by a step function created by take-while odd? transducer.
A completing process must call the completion operation on the final accumulated value exactly once.
This is a scenario where a transducer returns a stateful step function. A good example would be partition-by transducer. For example when (partition-by odd?) is used by the transducible process for processing (1 3 2 4 5 2) it will produce ((1 3) (2 4) (5) (6 8)).
(fn [rf]
(let [a (java.util.ArrayList.)
pv (volatile! ::none)]
(fn
([] (rf))
([result]
(let [result (if (.isEmpty a)
result
(let [v (vec (.toArray a))]
;;clear first!
(.clear a)
(unreduced (rf result v))))]
(rf result)))
([result input]
(let [pval #pv
val (f input)]
(vreset! pv val)
(if (or (identical? pval ::none)
(= val pval))
(do
(.add a input)
result)
(let [v (vec (.toArray a))]
(.clear a)
(let [ret (rf result v)]
(when-not (reduced? ret)
(.add a input))
ret))))))))
If you take a look at the implementation you will notice that the step function won't return it's accumulated values (stored in a array list) until the predicate function will return a different result (e.g. after a sequence of odd numbers it will receive an even number, it will return a seq of accumulated odd numbers). The issue is if we reach the end of the source data - there will be no chance to observe a change in the predicate result value and the accumulated value wouldn't be returned. Thus the transducible process must call a completion operation of the step function (arity 1) so it can return its accumulated result (in our case (6 8)).
A transducing process must encapsulate references to the function returned by invoking a transducer - these may be stateful and unsafe for use across threads.
When a transducible process is executed by passing a source data and transducer instance, it will first call the transducer function to produce a step function. The transducer is a function of the following shape:
(fn [xf]
(fn ([] ...)
([result] ...)
([result input] ...)))
Thus the transducible process will call this top level function (accepting xf - a reducing function) to obtain the actual step function used for processing the data elements. The issue is that the transducible process must keep the reference to that step function and use the same instance for processing elements from a particular data source (e.g. the step function instance produced partition-by transducer must be used for processing the whole input sequence as it keeps its internal state as you saw above). Using different instances for processing a single data source would yield incorrect results.
Similarly, a transducible process cannot reuse a step function instance for processing multiple data sources due to the same reason - the step function instance might be stateful and keep an internal state for processing a particular data source. That state would be corrupted when the step function would be used for processing another data source.
Also there is no guarantee if the step function implementation is thread safe or not.
What does "context" refer to in this context?
"A new context for applying transducers" means implementing a new type of a transducible process. Clojure provides transducible processes working with collections (e.g. into, sequence). core.async library chan function (one of its arities) accepts a transducer instance as an argument which produces an asynchronous transducible process producing values (that can be consumed from the channel) by applying a transducer to consumed values.
You could for example create a transducible process for handling data received on a socket, or your own implementation of observables.
They could use transducers for transforming the data as transducers are agnostic when it comes where the data comes from (a socket, a stream, collection, an event source etc.) - it is just a function called with individual elements.
They also don't care (and don't know) what should be done with the result they generate (e.g. should it be appended to a result sequence (for example conj)? should it be sent over network? inserted to a database?) - it's abstracted by using a reducing function that is captured by the step function (rf argument above).
So instead of creating a step function that just uses conj or saves elements to db, we pass a function which has a specific implementation of that operation. And your transducible process defines what that operation is.
I have a function that I'd like to run multiple times, generating a list of the results:
(take 10 (repeatedly #(myfunc)))
I realized I could run them in parallel with pmap:
(pmap (fn [_] (myfunc)) (range 10))
But it is a bit untidy. Is there a standard function that lets me do this Something like:
(prun 10 #(myfunc))
?
You may also be interested in The Claypoole library for managing threadpools and parallel processing. Look at their version of pmap and pfor.
I don't think there's an existing function, but using pcalls rather than pmap seems a little closer to what you want:
(defn prun [n f]
(apply pcalls (repeat n f)))
You don't need to wrap myfunc with #() in the call torepeatedly, btw, nor calling prun as defined above:
(prun 10 myfunc)
You may find pvalues useful as well.
You can use dotimes
(dotimes [_ 10] (myfunc))
This will run your function 10 times. Be sure to run this in the same namespace as your function
If I perform a side-effecting/mutating operation on individual data structures specific to each member of lazy sequence using map, do I need to (a) call doall first, to force realization of the original sequence before performing the imperative operations, or (b) call doall to force the side-effects to occur before I map a functional operation over the resulting sequence?
I believe that no doalls are necessary when there are no dependencies between elements of any sequence, since map can't apply a function to a member of a sequence until the functions from maps that produced that sequence have been applied to the corresponding element of the earlier sequence. Thus, for each element, the functions will be applied in the proper sequence, even though one of the functions produces side effects that a later function depends on. (I know that I can't assume that any element a will have been modified before element b is, but that doesn't matter.)
Is this correct?
That's the question, and if it's sufficiently clear, then there's no need to read further. The rest describes what I'm trying to do in more detail.
My application has a sequence of defrecord structures ("agents") each of which contains some core.matrix vectors (vec1, vec2) and a core.matrix matrix (mat). Suppose that for the sake of speed, I decide to (destructively, not functionally) modify the matrix.
The program performs the following three steps to each of the agents by calling map, three times, to apply each step to each agent.
Update a vector vec1 in each agent, functionally, using assoc.
Modify a matrix mat in each agent based on the preceding vector (i.e. the matrix will retain a different state).
Update a vector vec2 in each agent using assoc based on the state of the matrix produced by step 2.
For example, where persons is a sequence, possibly lazy (EDIT: Added outer doalls):
(doall
(->> persons
(map #(assoc % :vec1 (calc-vec1 %))) ; update vec1 from person
(map update-mat-from-vec1!) ; modify mat based on state of vec1
(map #(assoc % :vec2 (calc-vec2-from-mat %))))) ; update vec2 based on state of mat
Alternatively:
(doall
(map #(assoc % :vec2 (calc-vec2-from-mat %)) ; update vec2 based on state of mat
(map update-mat-from-vec1! ; modify mat based on state of vec1
(map #(assoc % :vec1 (calc-vec1 %)) persons)))) ; update vec1 from person
Note that no agent's state depends on the state of any other agent at any point. Do I need to add doalls?
EDIT: Overview of answers as of 4/16/2014:
I recommend reading all of the answers given, but it may seem as if they conflict. They don't, and I thought it might be useful if I summarized the main ideas:
(1) The answer to my question is "Yes": If, at the end of the process I described, one causes the entire lazy sequence to be realized, then what is done to each element will occur according to the correct sequence of steps (1, 2, 3). There is no need to apply doall before or after step 2, in which each element's data structure is mutated.
(2) But: This is a very bad idea; you are asking for trouble in the future. If at some point you inadvertently end up realizing all or part of the sequence at a time other than what you originally intended, it could turn out that the later steps get values from the data structure that were put there at at the wrong time--at a time other than what you expect. The step that mutates a per-element data structure won't happen until a given element of the lazy seq is realized, so if you realize it at the wrong time, you could get the wrong data in later steps. This could be the kind of bug that is very difficult to track down. (Thanks to #A.Webb for making this problem very clear.)
Use extreme caution mixing laziness with side effects
(defrecord Foo [fizz bang])
(def foos (map ->Foo (repeat 5 0) (map atom (repeat 5 1))))
(def foobars (map #(assoc % :fizz #(:bang %)) foos))
So will my fizz of foobars now be 1?
(:fizz (first foobars)) ;=> 1
Cool, now I'll leave foobars alone and work with my original foos...
(doseq [foo foos] (swap! (:bang foo) (constantly 42)))
Let's check on foobars
(:fizz (first foobars)) ;=> 1
(:fizz (second foobars)) ;=> 42
Whoops...
Generally, use doseq instead of map for your side effects or be aware of the consequences of delaying your side effects until realization.
You do not need to add any calls to doall provided you do something with the results later in your program. For instance if you ran the above maps, and did nothing with the result then none of the elements will be realized. On the other hand, if you read through the resulting sequence, to print it for instance, then each of your computations will happen in order on each element sequentially. That is steps 1, 2, and 3 will happen to the first thing in the input sequence, then steps 1, 2, and 3 will happen to the second and so forth. There is no need to pre-realize sequences to ensure the values are available, lazy evaluation will take care of that.
You don't need to add doall between two map operations. But unless you're working in a REPL, you do need to add doall or dorun to force the execution of your lazy sequence.
This is true, unless you care about the order of operations.
Let's consider the following example:
(defn f1 [x]
(print "1>" x ", ")
x)
(defn f2 [x]
(print "2>" x ", ")
x)
(defn foo [mycoll]
(->> mycoll
(map f1)
(map f2)
dorun))
By default clojure will take the first chunk of mycoll and apply f1 to all elements of this chunk. Then it'll apply f2 to the resulting chunk.
So, if mycoll if a list or an ordinary lazy sequence, you'll see that f1 and f2 are applied to each element in turn:
=> (foo (list \a \b))
1> a , 2> a , 1> b , 2> b , nil
or
=> (->> (iterate inc 7) (take 2) foo)
1> 7 , 2> 7 , 1> 8 , 2> 8 , nil
But if mycoll is a vector or chunked lazy sequence, you'll see quite a different thing:
=> (foo [\a \b])
1> a , 1> b , 2> a , 2> b , nil
Try
=> (foo (range 50))
and you'll see that it processes elements in chunks by 32 elements.
So, be careful using lazy calculations with side effects!
Here are some hints for you:
Always end you command with doall or dorun to force the calculation.
Use doall and comp to control the order of calculations, e.g.:
(->> [\a \b]
; apply both f1 and f2 before moving to the next element
(map (comp f2 f1))
dorun)
(->> (list \a \b)
(map f1)
; process the whole sequence before applying f2
doall
(map f2)
dorun)
map always produces a lazy result, even for a non-lazy input. You should call doall (or dorun if the sequence will never be used and the mapping is only done for side effects) on the output of map if you need to force some imperative side effect (for example use a file handle or db connection before it is closed).
user> (do (map println [0 1 2 3]) nil)
nil
user> (do (doall (map println [0 1 2 3])) nil)
0
1
2
3
nil
In trying to replicate some websockets examples I've run into some behavior I don't understand and can't seem to find documentation for. Simplified, here's an example I'm running in lein that's supposed to run a function for every element in a shared map once per second:
(def clients (atom {"a" "b" "c" "d" }))
(def ticker-agent (agent nil))
(defn execute [a]
(println "execute")
(let [ keys (keys #clients) ]
(println "keys= " keys )
(doseq [ x keys ] (println x)))
;(map (fn [k] (println k)) keys)) ;; replace doseq with this?
(Thread/sleep 1000)
(send *agent* execute))
(defn -main [& args]
(send ticker-agent execute)
)
If I run this with map I get
execute
keys= (a c)
execute
keys= (a c)
...
First confusing issue: I understand that I'm likely using map incorrectly because there's no return value, but does that mean the inner println is optimized away? Especially given that if I run this in a repl:
(map #(println %) '(1 2 3))
it works fine?
Second question - if I run this with doseq instead of map I can run into conditions where the execution agent stops (which I'd append here, but am having difficulty isolating/recreating). Clearly there's something I"m missing possibly relating to locking on the maps keyset? I was able to do this even moving the shared map out of an atom. Is there default syncrhonization on the clojure map?
map is lazy. This means that it does not calculate any result until the result is accessed from the data structure it reteruns. This means that it will not run anything if its result is not used.
When you use map from the repl the print stage of the repl accesses the data, which causes any side effects in your mapped function to be invoked. Inside a function, if the return value is not investigated, any side effects in the mapping function will not occur.
You can use doall to force full evaluation of a lazy sequence. You can use dorun if you don't need the result value but want to ensure all side effects are invoked. Also you can use mapv which is not lazy (because vectors are never lazy), and gives you an associative data structure, which is often useful (better random access performance, optimized for appending rather than prepending).
Edit: Regarding the second part of your question (moving this here from a comment).
No, there is nothing about doseq that would hang your execution, try checking the agent-error status of your agent to see if there is some exception, because agents stop executing and stop accepting new tasks by default if they hit an error condition. You can also use set-error-model and set-error-handler! to customize the agent's error handling behavior.
Why does this bit of Clojure code:
user=> (map (constantly (println "Loop it.")) (range 0 3))
Yield this output:
Loop it.
(nil nil nil)
I'd expect it to print "Loop it" three times as a side effect of evaluating the function three times.
constantly doesn't evaluate its argument multiple times. It's a function, not a macro, so the argument is evaluated exactly once before constantly runs. All constantly does is it takes its (evaluated) argument and returns a function that returns the given value every time it's called (without re-evaluating anything since, as I said, the argument is evaluated already before constantly even runs).
If all you want to do is to call (println "Loop it") for every element in the range, you should pass that in as the function to map instead of constantly. Note that you'll actually have to pass it in as a function, not an evaluated expression.
As sepp2k rightly points out constantly is a function, so its argument will only be evaluated once.
The idiomatic way to achieve what you are doing here would be to use doseq:
(doseq [i (range 0 3)]
(println "Loop it."))
Or alternatively dotimes (which is a little more concise and efficient in this particular case as you aren't actually using the sequence produced by range):
(dotimes [i 3]
(println "Loop it."))
Both of these solutions are non-lazy, which is probably what you want if you are just running some code for the side effects.
You can get a behavior close to your intent by usig repeatedly and a lambda expression.
For instance:
(repeatedly 3 #(println "Loop it"))
Unless you're at the REPL, this needs to be surrounded by a dorun or similar. repeatedly is lazy.