understanding clojure transducers pitfalls - clojure

The clojure reference contains the following comments about transducers, which seem like saying something important about the safety of writing and using transducers:
If you have a new context for applying transducers, there are a few general rules to be aware of:
If a step function returns a reduced value, the transducible process must not supply any more inputs to the step function. The
reduced value must be unwrapped with deref before completion.
A completing process must call the completion operation on the final accumulated value exactly once.
A transducing process must encapsulate references to the function returned by invoking a transducer - these may be stateful and unsafe
for use across threads.
Can you explain, possibly with some examples, what each of these cases mean? also, what does "context" refer to in this context?
Thanks!

If a step function returns a reduced value, the transducible process must not supply any more inputs to the step function. The reduced value must be unwrapped with deref before completion.
One example of this scenario is take-while transducer:
(fn [rf]
(fn
([] (rf))
([result] (rf result))
([result input]
(if (pred input)
(rf result input)
(reduced result)))))
As you can see, it can return a reduced value which means there is no point (and actually it would be an error) to provide more input to such step function - we know already there can be no more values produced.
For example while processing (1 1 3 5 6 8 7) input collection with odd? predicate once we reach value 6 there will be no more values returned by a step function created by take-while odd? transducer.
A completing process must call the completion operation on the final accumulated value exactly once.
This is a scenario where a transducer returns a stateful step function. A good example would be partition-by transducer. For example when (partition-by odd?) is used by the transducible process for processing (1 3 2 4 5 2) it will produce ((1 3) (2 4) (5) (6 8)).
(fn [rf]
(let [a (java.util.ArrayList.)
pv (volatile! ::none)]
(fn
([] (rf))
([result]
(let [result (if (.isEmpty a)
result
(let [v (vec (.toArray a))]
;;clear first!
(.clear a)
(unreduced (rf result v))))]
(rf result)))
([result input]
(let [pval #pv
val (f input)]
(vreset! pv val)
(if (or (identical? pval ::none)
(= val pval))
(do
(.add a input)
result)
(let [v (vec (.toArray a))]
(.clear a)
(let [ret (rf result v)]
(when-not (reduced? ret)
(.add a input))
ret))))))))
If you take a look at the implementation you will notice that the step function won't return it's accumulated values (stored in a array list) until the predicate function will return a different result (e.g. after a sequence of odd numbers it will receive an even number, it will return a seq of accumulated odd numbers). The issue is if we reach the end of the source data - there will be no chance to observe a change in the predicate result value and the accumulated value wouldn't be returned. Thus the transducible process must call a completion operation of the step function (arity 1) so it can return its accumulated result (in our case (6 8)).
A transducing process must encapsulate references to the function returned by invoking a transducer - these may be stateful and unsafe for use across threads.
When a transducible process is executed by passing a source data and transducer instance, it will first call the transducer function to produce a step function. The transducer is a function of the following shape:
(fn [xf]
(fn ([] ...)
([result] ...)
([result input] ...)))
Thus the transducible process will call this top level function (accepting xf - a reducing function) to obtain the actual step function used for processing the data elements. The issue is that the transducible process must keep the reference to that step function and use the same instance for processing elements from a particular data source (e.g. the step function instance produced partition-by transducer must be used for processing the whole input sequence as it keeps its internal state as you saw above). Using different instances for processing a single data source would yield incorrect results.
Similarly, a transducible process cannot reuse a step function instance for processing multiple data sources due to the same reason - the step function instance might be stateful and keep an internal state for processing a particular data source. That state would be corrupted when the step function would be used for processing another data source.
Also there is no guarantee if the step function implementation is thread safe or not.
What does "context" refer to in this context?
"A new context for applying transducers" means implementing a new type of a transducible process. Clojure provides transducible processes working with collections (e.g. into, sequence). core.async library chan function (one of its arities) accepts a transducer instance as an argument which produces an asynchronous transducible process producing values (that can be consumed from the channel) by applying a transducer to consumed values.
You could for example create a transducible process for handling data received on a socket, or your own implementation of observables.
They could use transducers for transforming the data as transducers are agnostic when it comes where the data comes from (a socket, a stream, collection, an event source etc.) - it is just a function called with individual elements.
They also don't care (and don't know) what should be done with the result they generate (e.g. should it be appended to a result sequence (for example conj)? should it be sent over network? inserted to a database?) - it's abstracted by using a reducing function that is captured by the step function (rf argument above).
So instead of creating a step function that just uses conj or saves elements to db, we pass a function which has a specific implementation of that operation. And your transducible process defines what that operation is.

Related

How do you call function once pmap has completed in Clojure?

In Clojure how can you run a method like pmap and call a function when all items in the collection have been processed?
you can wrap it in a function, that will enforce the mapping and call some callback function after that:
user> (defn pmap-callback [callback f & colls]
(let [res (doall (apply pmap f colls))]
(callback res)
res))
#'user/pmap-callback
user> (pmap-callback #(println "complete!" %)
+ [1 2 3] [4 5 6])
;;=> complete! (5 7 9)
(5 7 9)
From pmap doc string:
Like map, except f is applied in parallel. Semi-lazy in that the
parallel computation stays ahead of the consumption, but doesn't
realize the entire result unless required. Only useful for
computationally intensive functions where the time of f dominates the
coordination overhead.
So the pmap result won't be computed completely until you ask for all the elements of the result sequence. As with other lazy sequences you can force the lazy sequence to be fully realized using doall (if you need to retain the whole evaluated sequence content) or dorun (if you are interested only the side effects done by your mapping function and don't need the evaluated sequence).
To execute a function when all the results have been computed you can just call it directly after the call to dorun or doall.
It's also worth noting that pmap is not a tool to scheduling asynchronous parallel jobs where you could get notified when your processing finishes. For such use cases Clojure core.async might be a better choice.

What are side-effects in predicates and why are they bad?

I'm wondering what is considered to be a side-effect in predicates for fns like remove or filter. There seems to be a range of possibilities. Clearly, if the predicate writes to a file, this is a side-effect. But consider a situation like this:
(def *big-var-that-might-be-garbage-collected* ...)
(let [my-ref *big-var-that-might-be-garbage-collected*]
(defn my-pred
[x]
(some-operation-on my-ref x)))
Even if some-operation-on is merely a query that does not change state, the fact that my-pred retains a reference to *big... changes the state of the system in that the big var cannot be garbage collected. Is this also considered to be side-effect?
In my case, I'd like to write to a logging system in a predicate. Is this a side effect?
And why are side-effects in predicates discouraged exactly? Is it because filter and remove and their friends work lazily so that you cannot determine when the predicates are called (and - hence - when the side-effects happen)?
GC is not typically considered when evaluating if a function is pure or not, although many actions that make a function impure can have a GC effect.
Logging is a side effect, as is changing any state in the program or the world. A pure function takes data and returns data, without modifying anything else.
https://softwareengineering.stackexchange.com/questions/15269/why-are-side-effects-considered-evil-in-functional-programming covers why side effects are avoided in functional languages.
I found this link helpful
The problem is determining when, or even whether, the side-effects will occur on any given call to the function.
If you only care that the same inputs return the same answer, you are fine. Side-effects are dependent on how the function is executed.
For example,
(first (filter odd? (range 20)))
; 1
But if we arrange for odd? to print its argument as it goes:
(first (filter #(do (print %) (odd? %)) (range 20)))
It will print 012345678910111213141516171819 before returning 1!
The reason is that filter, where it can, deals with its sequence argument in chunks of 32 elements.
If we take the limit off the range:
(first (filter #(do (print %) (odd? %)) (range)))
... we get a full-size chunk printed: 012345678910111213141516171819012345678910111213141516171819202122232425262728293031
Just printing the argument is confusing. If the side effects are significant, things could go seriously awry.

Clojure confusion - behavior of map, doseq in a multiprocess environment

In trying to replicate some websockets examples I've run into some behavior I don't understand and can't seem to find documentation for. Simplified, here's an example I'm running in lein that's supposed to run a function for every element in a shared map once per second:
(def clients (atom {"a" "b" "c" "d" }))
(def ticker-agent (agent nil))
(defn execute [a]
(println "execute")
(let [ keys (keys #clients) ]
(println "keys= " keys )
(doseq [ x keys ] (println x)))
;(map (fn [k] (println k)) keys)) ;; replace doseq with this?
(Thread/sleep 1000)
(send *agent* execute))
(defn -main [& args]
(send ticker-agent execute)
)
If I run this with map I get
execute
keys= (a c)
execute
keys= (a c)
...
First confusing issue: I understand that I'm likely using map incorrectly because there's no return value, but does that mean the inner println is optimized away? Especially given that if I run this in a repl:
(map #(println %) '(1 2 3))
it works fine?
Second question - if I run this with doseq instead of map I can run into conditions where the execution agent stops (which I'd append here, but am having difficulty isolating/recreating). Clearly there's something I"m missing possibly relating to locking on the maps keyset? I was able to do this even moving the shared map out of an atom. Is there default syncrhonization on the clojure map?
map is lazy. This means that it does not calculate any result until the result is accessed from the data structure it reteruns. This means that it will not run anything if its result is not used.
When you use map from the repl the print stage of the repl accesses the data, which causes any side effects in your mapped function to be invoked. Inside a function, if the return value is not investigated, any side effects in the mapping function will not occur.
You can use doall to force full evaluation of a lazy sequence. You can use dorun if you don't need the result value but want to ensure all side effects are invoked. Also you can use mapv which is not lazy (because vectors are never lazy), and gives you an associative data structure, which is often useful (better random access performance, optimized for appending rather than prepending).
Edit: Regarding the second part of your question (moving this here from a comment).
No, there is nothing about doseq that would hang your execution, try checking the agent-error status of your agent to see if there is some exception, because agents stop executing and stop accepting new tasks by default if they hit an error condition. You can also use set-error-model and set-error-handler! to customize the agent's error handling behavior.

Which clojure parallelism technique to use when searching a growing solution space?

What is the correct way, in Clojure, to do parallel processing when each job of the processing can occur in utter isolation and may generate a list of additional jobs that need to be evaluated?
My actual problem is a nutritional calculation problem, but I will put this in the form of Chess which shares the same problem space traits as my calculation.
Assume, for instance, that I am trying to find all of the moves to Checkmate in a game of Chess. When searching through the board states, I would start out with 20 possible states, each representing a different possible opening move. Each of those will need to be evaluated, accepted or rejected, and then for each accepted move, a new list of jobs would be created representing all of the possible next moves. The jobs would look like this:
initial: '([] proposed-move)
accepted: '([move] proposed-response)
'([move move] proposed-response)
The number of states to evaluates grows as a result of each computation, and each state can be evaluated in complete isolation from all of the others.
A solution I am playing with goes as such:
; a list of all final solutions, each of which is a sequence of moves
(def solutions (agent []))
; a list of all jobs pending evaluation
(def jobs (agent []))
Given these definitions, I would have a java thread pool, and each thread would request a job from the jobs agent (and wait for that request to be fulfilled). It would then run the calculation, generate a list of solutions and possible solutions. Finally, it would send the solutions to the solutions agent, and the possible solutions to the jobs agent.
Is using a combination of agents and threads the most idiomatic way to go in this case? Can I even get data out of the job queue in the way I am proposing?
Or should my jobs be a java.util.concurrent.LinkedBlockingQueue, as described in Producer consumer with qualifications?
You can do this with the following approach:
Repeated applications of pmap (which provides parallel processing of all elements in collection)
The function used in pmap returns a list of elements. Could be zero, one or multiple elements, which will then be processed in the next iteration
The results get recombined with concat
You repeat the processing of the list for as many times as you like, perhaps storing the result in an atom.
Example code could be something like the following
(def jobs (atom '(1 10 100)))
(defn process-element [value]
(if (< (rand) 0.8)
[(inc value)]
[]))
(defn do-processing []
(swap! jobs
(fn [job-list] (apply concat (pmap process-element job-list)))))
(while (seq #jobs)
(prn #jobs)
(do-processing))
Whick could produce output like:
(1 10 100)
(2 11 101)
(3 12 102)
(4 13 103)
(5 14 104)
(6 15 105)
(7 106)
(107)
(108)
(109)
nil
Note that you need to be a bit careful to make sure your algorithm terminates! In the example this is guaranteed by the elements dying off over time, but if your seach space is growing then you will probably want to apply a time limit instead of just using a (while ... ) loop.
Your approach with agents and threads seems quite close to (what I see as) idiomatic clojure.
the only thing I would change to make it more "clojure like" would be to use pmap to iterate over queue that is stored in an agent. using pmap instead of your own thread pool will save you the effort of managing the thread pool because pmap already uses clojure's thread pool which is initialized properly for the current number of processors. it also helps you take advantage of sequence chunking (which perhaps could help).
You could also use channels. Maybe something like this:
(def jobs (chan))
(def solutions (chan))
(def accepted-solutions (atom (vector)))
(go (loop [job (<! jobs)]
(when job
(go (doseq [solution (process-job-into-solutions job)]
(>! solutions)))
(recur (<! jobs)))))
(go (loop [solution (<! solutions)]
(when (acceptable? solution)
(swap! accepted-solutions conj solution)
(doseq [new-job (generate-new-jobs solution)]
(>! jobs))
(recur (<! solutions)))))
(>!! jobs initial-job)

Clojure: reduce, reductions and infinite lists

Reduce and reductions let you accumulate state over a sequence.
Each element in the sequence will modify the accumulated state until
the end of the sequence is reached.
What are implications of calling reduce or reductions on an infinite list?
(def c (cycle [0]))
(reduce + c)
This will quickly throw an OutOfMemoryError. By the way, (reduce + (cycle [0])) does not throw an OutOfMemoryError (at least not for the time I waited). It never returns. Not sure why.
Is there any way to call reduce or reductions on an infinite list in a way that makes sense? The problem I see in the above example, is that eventually the evaluated part of the list becomes large enough to overflow the heap. Maybe an infinite list is not the right paradigm. Reducing over a generator, IO stream, or an event stream would make more sense. The value should not be kept after it's evaluated and used to modify the state.
It will never return because reduce takes a sequence and a function and applies the function until the input sequence is empty, only then can it know it has the final value.
Reduce on a truly infinite seq would not make a lot of sense unless it is producing a side effect like logging its progress.
In your first example you are first creating a var referencing an infinite sequence.
(def c (cycle [0]))
Then you are passing the contents of the var c to reduce which starts reading elements to update its state.
(reduce + c)
These elements can't be garbage collected because the var c holds a reference to the first of them, which in turn holds a reference to the second and so on. Eventually it reads as many as there is space in the heap and then OOM.
To keep from blowing the heap in your second example you are not keeping a reference to the data you have already used so the items on the seq returned by cycle are GCd as fast as they are produced and the accumulated result continues to get bigger. Eventually it would overflow a long and crash (clojure 1.3) or promote itself to a BigInteger and grow to the size of all the heap (clojure 1.2)
(reduce + (cycle [0]))
Arthur's answer is good as far as it goes, but it looks like he doesn't address your second question about reductions. reductions returns a lazy sequence of intermediate stages of what reduce would have returned if given a list only N elements long. So it's perfectly sensible to call reductions on an infinite list:
user=> (take 10 (reductions + (range)))
(0 1 3 6 10 15 21 28 36 45)
If you want to keep getting items from a list like an IO stream and keep state between runs, you cannot use doseq (without resorting to def's). Instead a good approach would be to use loop/recur this will allow you to avoid consuming too much stack space and will let you keep state, in your case:
(loop [c (cycle [0])]
(if (evaluate-some-condition (first c))
(do-something-with (first c) (recur (rest c)))
nil))
Of course compared to your case there is here a condition check to make sure we don't loop indefinitely.
As others have pointed out, it doesn't make sense to run reduce directly on an infinite sequence, since reduce is non-lazy and needs to consume the full sequence.
As an alternative for this kind of situation, here's a helpful function that reduces only the first n items in a sequence, implemented using recur for reasonable efficiency:
(defn counted-reduce
([n f s]
(counted-reduce (dec n) f (first s) (rest s) ))
([n f initial s]
(if (<= n 0)
initial
(recur (dec n) f (f initial (first s)) (rest s)))))
(counted-reduce 10000000 + (range))
=> 49999995000000