use pmap to increase counter but missed values

use pmap to increase counter but missed values - clojure

(swap! foo (fn [old] 3))
(pmap
(fn [_] (swap! foo inc))
(range 10000))
#foo ;It prints 1027, which is smaller than 10000.
So, why foo is not 10003 ?
Anything to do with laziness ?

From the documentation of pmap:
Semi-lazy in that the parallel computation stays ahead of the consumption, but doesn't
realize the entire result unless required. Only useful for
computationally intensive functions where the time of f dominates
the coordination overhead.
So yes, it's only realizing a few elements. Try wrapping the pmap in a doall and you'll see your intended result.

Related

Clojure: when to use memoize and when to use delay/force?

I've just started learning Clojure and trying to understand the difference between 2 approaches which at first sight seem very identical.
(def func0 (delay (do
(println "did some work")
100)))
so.core=> (force my-delay2)
did some work
100
so.core=> (force my-delay2)
100
(defn vanilla-func [] (println "did some work") 100)
(def func1 (memoize vanilla-func))
so.core=> (func1)
did some work
100
so.core=> (func1)
100
Both approaches do some sort of function memoization. What am I missing?
I've tried to find the explanation on https://clojuredocs.org/clojure.core/delay & https://clojuredocs.org/clojure.core/memoize but couldn't.

delay holds one result and you have to deref to get the result.
memoize is an unbound cache, that caches the result depending on the
input arguments. E.g.
user=> (def myinc (memoize (fn [x] (println x) (inc x))))
#'user/myinc
user=> (myinc 1)
1
2
user=> (myinc 1)
2
In your (argument-less) example the only difference is that you can use
the result directly (no deref needed)
Classic use-cases for delay are things needed later, that would block
or delay startup. Or if you want to "hide" top-level defs from
the compiler (e.g. they do side-effects).
memoize is a classic cache and is best used if the calculation is
expensive and the set of input arguments is not excessive. There are
other caching options in the clojure-verse, that allow better
configurations (e.g. they are not unbound).

How to implement a parallel logical or with early termination in Clojure

I would like to define a predicate that, taking as input some predicates
with corresponding inputs (they could be given as a lazy sequence of calls),
runs them in parallel and computes the logical or of the results,
in such a way that, the moment a predicate call terminates returning true,
the whole computation also terminates (returning true).
Apart from offering time optimization, this would also help avoiding
non-termination in some cases (some predicate calls may not terminate).
Actually, interpreting non-termination as a third undefined value,
this predicate simulates the or operation in Kleene's K3 logic
(the join in the initial centered Kleene algebra).
Something similar is presented here for the Haskell family.
Is there any (preferably simple) way to do this in Clojure?
EDIT: I decided to add some clarifications after reading the comments.
(a) First of all, what happens after the thread pool gets exhausted is of less importance. I think creating a thread pool large enough for our needs is a reasonable convention.
(b) The most crucial requirement is that the predicate calls start running in parallel and, once a predicate call terminates returning true, all the other threads running get interrupted. The intended behavior is that:
if there is a predicate call returning true: the parallel or returns true
else if there is a predicate call that does not terminate: the parallel or does not terminate
else: the parallel or returns false
In other words, it behaves like the join in the 3-element lattice given by false<undefined<true, with undefined representing non-termination.
(c) The parallel or should be able to take as input many predicates and many predicate-inputs (each one corresponding to a predicate). But it would be even better if it took as input a lazy sequence. Then, naming the parallel or pany (for "parallel any"), we could have calls like the following:
(pany (map (comp eval list) predicates inputs))
(pany (map (comp eval list) predicates (repeat input)))
(pany (map (comp eval list) (repeat predicate) inputs)) which is equivalent to (pany (map predicate (unchunk inputs)))
As a final remark, I think that it is quite natural to ask for things like pany, a dual pall or a mechanism for building such early-terminating parallel reductions to be easily implementable or even built-in in a parallelism-oriented language like Clojure.

I will define our predicates in terms of a reducing function. Practically, we could reimplement all of the Clojure iteration functions to support this parallel operation, but I'll just use reduce as an example.
I'll define a computation function. I'll just use the same one, but nothing stopping you from having many. The function is "true" if it accumulates 1000.
(defn computor [acc val]
(let [new (+' acc val)] (if (> new 1000) (reduced new) new)))
(reduce computor 0 (range))
;; =>
1035
(reduce computor 0 (range Long/MIN_VALUE 0))
;; =>
;; ...this is a proxy for a non-returning computation
;; wrap these up in a form suitable for application of reduction
(def predicates [[computor 0 (range)]
[computor 0 (range Long/MIN_VALUE 0)]])
Now let's get to the meat of this. I want to take a step in each computation, and if one of the computations completes, I want to return it. In actual fact one step at a time using pmap is very slow - the units of work are too small to be worth threading. Here I've changed things to do 1000 iterations of each unit of work before moving on. You'd probably tune this based on your workload and the cost of a step.
(defn p-or-reducer* [reductions]
(let [splits (map #(split-at 1000 %) reductions) ;; do at least 1000 iterations per cycle
complete (some #(if (empty? (second %)) (last (first %))) splits)]
(or complete (recur (map second splits)))))
I then wrap this in a driver.
(defn p-or [s]
(p-or-reducer* (map #(apply reductions %) s)))
(p-or predicates)
;; =>
1035
Where to insert the CPU parallelism? s/map/pmap/ in p-or-reducer* should do it. I suggest just parallelising the first operation, as this will drive the reducing sequences to compute.
(defn p-or-reducer* [reductions]
(let [splits (pmap #(split-at 1000 %) reductions) ;; do at least 1000 iterations per cycle
complete (some #(if (empty? (second %)) (last (first %))) splits)]
(or complete (recur (map second splits)))))
(def parallelism-tester (conj (vec (repeat 40000 [computor 0 (range Long/MIN_VALUE 0)]))
[computor 0 (range)]))
(p-or parallelism-tester) ;; terminates even though the first 40K predicates will not
It's extremely hard to define a performant generic version of this. Without knowing the cost per iteration an efficient parallelism strategy is hard to derive - if one iteration take 10s then we'd probably take a single step at a time. If it takes 100ns then we need to take many steps at a time.

Will you consider adopting core.async to handle parallel tasks with async/go or async/thread, and early return with async/alts!?
For example, to turn the core or function from serial into parallel. We can create a macro (I called it por) to wrap the input functions (or predicates) into async/thread and then do a socket select async/alts! on top of them:
(defmacro por [& fns]
`(let [[v# c#] (async/alts!!
[~#(for [f fns]
(list `async/thread f))])]
v#))
(time
(por (do (println "running a") (Thread/sleep 30) :a)
(do (println "running b") (Thread/sleep 20) :b)
(do (println "running c") (Thread/sleep 10) :c)))
;; running a
;; running b
;; running c
;; "Elapsed time: 11.919169 msecs"
;; => :c
In comparison with the original or (which run in serial):
(time
(or (do (println "running a") (Thread/sleep 30) :a)
(do (println "running b") (Thread/sleep 20) :b)
(do (println "running c") (Thread/sleep 10) :c)))
;; running a
;; => :a
;; "Elapsed time: 31.642506 msecs"

Why is there no significant speedup using reducers in this example?

(require '[clojure.core.reducers :as r])
(def data (into [] (take 10000000 (repeatedly #(rand-int 1000)))))
(defn frequencies [coll]
(reduce (fn [counts x]
(merge-with + counts {x 1}))
{} coll))
(defn pfrequencies [coll]
(r/reduce (fn [counts x]
(merge-with + counts {x 1}))
{} coll))
user=> (time (do (frequencies data) nil))
"Elapsed time: 29697.183 msecs"
user=> (time (do (pfrequencies data) nil))
"Elapsed time: 25273.794 msecs"
user=> (time (do (frequencies data) nil))
"Elapsed time: 25384.086 msecs"
user=> (time (do (pfrequencies data) nil))
"Elapsed time: 25778.502 msecs"
And who can show me an example with significant speedup?
I'm running on Mac OSX 10.7.5 with Java 1.7 on an Intel Core i7 (2 cores, http://ark.intel.com/products/54617).

You called it pfrequencies, which, along with your parallel-processing tag on the question, suggests you think that something is using multiple threads here. That is not the case, and neither is it the "main" goal of the reducers library.
The main thing reducers buy you is that you don't need to allocate many intermediate cons cells for your lazy sequences. Before reducers were introduced, frequencies would allocate 10000000 cons cells to create a sequential view of the vector for reduce to use. Now that reducers exist, vectors know how to reduce themselves without creating such temporary objects. But that feature has been backported into clojure.core/reduce, which behaves exactly like r/reduce (ignoring some minor features that are irrelevant here). So you are just benchmarking your function against an identical clone of itself.
The reducers library also includes the notion of a fold, which can do some work in parallel, and then later merge together the intermediate results. To use this, you need to provide more information than reduce needs: you must define how to start a "chunk" from nothing; your function must be associative; and you must specify how to combine chunks. A. Webb's answer demonstrates how to use fold correctly, to get work done on multiple threads.
However, you're unlikely to get any benefit from folding: in addition to the reason he notes (you give up transients, compared to clojure.core/frequencies), building a map is not easily parallelizable. If the bulk of the work in frequencies were addition (as it would be in something like (frequencies (repeat 1e6 1))), then fold would help; but most of the work is in managing the keys in the hashmap, which really has to be single-threaded eventually. You can build maps in parallel, but then you have to merge them together; since that combination step takes time proportional to the size of the chunk, rather than constant time, you gain little by doing the chunks on a separate thread anyway.

A fold version of your frequencies function would look something like
(defn pfrequencies [coll]
(r/fold
(fn combinef
([] {})
([x y] (merge-with + x y)))
(fn reducef
([counts x] (merge-with + counts {x 1})))
coll))
On 2 cores, it will likely be much slower than clojure.core/frequencies which uses transients. At least on 4 cores, it is faster (2x) than the first implementation, but still slower than clojure.core/frequencies.
You might also experiment with
(defn p2frequencies [coll]
(apply merge-with + (pmap clojure.core/frequencies (partition-all 512 coll))))

Some serious food for thought in the answers here. In this specific case maps should not be needed, since the result domain can be easily predicted and put in a vector where the index can be used. So, a naive implementation of a naive problem would be something like:
(defn freqs
[coll]
(reduce (fn [counts x] (assoc counts x (inc (get counts x))))
(vec (int-array 1000 0))
coll))
(defn rfreqs
[coll]
(r/fold
(fn combinef
([] (vec (int-array 1000 0)))
([& cols] (apply mapv + cols)))
(fn reducef
[counts x] (assoc counts x (inc (get counts x))))
coll))
Here the combinef would be a simple map addition over the 1000 columns of the resulting collections, which should be negligible.
This gives the reducer version a speedup of about 2-3x over the normal version, especially on bigger (10x-100x) datasets. Some twiddling with the partition size of r/fold (optional 'n' parameter) can be done as finetuning. Seemed optimal to use (* 16 1024) with a data size of 1E8 (need 6GB JVM at least).
You could even use transients in both versions, but I didn't notice much improvements.
I know this version isn't appropriate for generic usage, but it might show the speed improvement without the hash management overhead.

How many threads does Clojure's pmap function spawn for URL-fetching operations?

The documentation on the pmap function leaves me wondering how efficient it would be for something like fetching a collection of XML feeds over the web. I have no idea how many concurrent fetch operations pmap would spawn and what the maximum would be.

If you check the source you see:
> (use 'clojure.repl)
> (source pmap)
(defn pmap
"Like map, except f is applied in parallel. Semi-lazy in that the
parallel computation stays ahead of the consumption, but doesn't
realize the entire result unless required. Only useful for
computationally intensive functions where the time of f dominates
the coordination overhead."
{:added "1.0"}
([f coll]
(let [n (+ 2 (.. Runtime getRuntime availableProcessors))
rets (map #(future (f %)) coll)
step (fn step [[x & xs :as vs] fs]
(lazy-seq
(if-let [s (seq fs)]
(cons (deref x) (step xs (rest s)))
(map deref vs))))]
(step rets (drop n rets))))
([f coll & colls]
(let [step (fn step [cs]
(lazy-seq
(let [ss (map seq cs)]
(when (every? identity ss)
(cons (map first ss) (step (map rest ss)))))))]
(pmap #(apply f %) (step (cons coll colls))))))
The (+ 2 (.. Runtime getRuntime availableProcessors)) is a big clue there. pmap will grab the first (+ 2 processors) pieces of work and run them asynchronously via future. So if you have 2 cores, it's going to launch 4 pieces of work at a time, trying to keep a bit ahead of you but the max should be 2+n.
future ultimately uses the agent I/O thread pool which supports an unbounded number of threads. It will grow as work is thrown at it and shrink if threads are unused.

Building on Alex's excellent answer that explains how pmap works, here's my suggestion for your situation:
(doall
(map
#(future (my-web-fetch-function %))
list-of-xml-feeds-to-fetch))
Rationale:
You want as many pieces of work in-flight as you can, since most will block on network IO.
Future will fire off an asynchronous piece of work for each request, to be handled in a thread pool. You can let Clojure take care of that intelligently.
The doall on the map will force the evaluation of the full sequence (i.e. the launch of all the requests).
Your main thread can start dereferencing the futures right away, and can therefore continue making progress as the individual results come back

No time to write a long response, but there's a clojure.contrib http-agent which creates each get/post request as its own agent. So you can fire off a thousand requests and they'll all run in parallel and complete as the results come in.

Looking the operation of pmap, it seems to go 32 threads at a time no mater what number of processors you have, the issue is that map will go ahead of the computation by 32 and the futures are started in their own. (SAMPLE)
(defn samplef [n]
(println "starting " n)
(Thread/sleep 10000)
n)
(def result (pmap samplef (range 0 100)))
; you will wait for 10 seconds and see 32 prints then when you take the 33rd an other 32
; prints this mins that you are doing 32 concurrent threads at a time
; to me this is not perfect
; SALUDOS Felipe

Should not a tail-recursive function also be faster?

I have the following Clojure code to calculate a number with a certain "factorable" property. (what exactly the code does is secondary).
(defn factor-9
([]
(let [digits (take 9 (iterate #(inc %) 1))
nums (map (fn [x] ,(Integer. (apply str x))) (permutations digits))]
(some (fn [x] (and (factor-9 x) x)) nums)))
([n]
(or
(= 1 (count (str n)))
(and (divisible-by-length n) (factor-9 (quot n 10))))))
Now, I'm into TCO and realize that Clojure can only provide tail-recursion if explicitly told so using the recur keyword. So I've rewritten the code to do that (replacing factor-9 with recur being the only difference):
(defn factor-9
([]
(let [digits (take 9 (iterate #(inc %) 1))
nums (map (fn [x] ,(Integer. (apply str x))) (permutations digits))]
(some (fn [x] (and (factor-9 x) x)) nums)))
([n]
(or
(= 1 (count (str n)))
(and (divisible-by-length n) (recur (quot n 10))))))
To my knowledge, TCO has a double benefit. The first one is that it does not use the stack as heavily as a non tail-recursive call and thus does not blow it on larger recursions. The second, I think is that consequently it's faster since it can be converted to a loop.
Now, I've made a very rough benchmark and have not seen any difference between the two implementations although. Am I wrong in my second assumption or does this have something to do with running on the JVM (which does not have automatic TCO) and recur using a trick to achieve it?
Thank you.

The use of recur does speed things up, but only by about 3 nanoseconds (really) over a recursive call. When things get that small they can be hidden in the noise of the rest of the test. I wrote four tests (link below) that are able to illustrate the difference in performance.
I'd also suggest using something like criterium when benchmarking. (Stack Overflow won't let me post with more than 1 link since I've got no reputation to speak of, so you'll have to google it, maybe "clojure criterium")
For formatting reasons, I've put the tests and results in this gist.
Briefly, to compare relatively, if the recursive test is 1, then the looping test is about 0.45, and the TCO tests about 0.87 and the absolute difference between the recursive and TCO tests are around 3ns.
Of course, all the caveats about benchmarking apply.

When optimizing any code, it's good to start from potential or actual bottlenecks and optimize that first.
It seems to me that this particular piece of code is eating most of your CPU time:
(map (fn [x] ,(Integer. (apply str x))) (permutations digits))
And that doesn't depend on TCO in any way - it is executed in same way. So, tail call in this particular example will allow you not to use up all the stack, but to achieve better performance, try optimizing this.

just a gentile reminder that clojure has no TCO

After evaluating factor-9 (quot n 10) an and and an or has to be evaluated before the function can return. Thus it is not tail-recursive.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

use pmap to increase counter but missed values - clojure

(swap! foo (fn [old] 3)) (pmap (fn [_] (swap! foo inc)) (range 10000)) #foo ;It prints 1027, which is smaller than 10000. So, why foo is not 10003 ? Anything to do with laziness ?

Related

Clojure: when to use memoize and when to use delay/force?

How to implement a parallel logical or with early termination in Clojure

Why is there no significant speedup using reducers in this example?

How many threads does Clojure's pmap function spawn for URL-fetching operations?

Should not a tail-recursive function also be faster?

Categories

Resources