Why is there no significant speedup using reducers in this example? - clojure

(require '[clojure.core.reducers :as r])
(def data (into [] (take 10000000 (repeatedly #(rand-int 1000)))))
(defn frequencies [coll]
(reduce (fn [counts x]
(merge-with + counts {x 1}))
{} coll))
(defn pfrequencies [coll]
(r/reduce (fn [counts x]
(merge-with + counts {x 1}))
{} coll))
user=> (time (do (frequencies data) nil))
"Elapsed time: 29697.183 msecs"
user=> (time (do (pfrequencies data) nil))
"Elapsed time: 25273.794 msecs"
user=> (time (do (frequencies data) nil))
"Elapsed time: 25384.086 msecs"
user=> (time (do (pfrequencies data) nil))
"Elapsed time: 25778.502 msecs"
And who can show me an example with significant speedup?
I'm running on Mac OSX 10.7.5 with Java 1.7 on an Intel Core i7 (2 cores, http://ark.intel.com/products/54617).

You called it pfrequencies, which, along with your parallel-processing tag on the question, suggests you think that something is using multiple threads here. That is not the case, and neither is it the "main" goal of the reducers library.
The main thing reducers buy you is that you don't need to allocate many intermediate cons cells for your lazy sequences. Before reducers were introduced, frequencies would allocate 10000000 cons cells to create a sequential view of the vector for reduce to use. Now that reducers exist, vectors know how to reduce themselves without creating such temporary objects. But that feature has been backported into clojure.core/reduce, which behaves exactly like r/reduce (ignoring some minor features that are irrelevant here). So you are just benchmarking your function against an identical clone of itself.
The reducers library also includes the notion of a fold, which can do some work in parallel, and then later merge together the intermediate results. To use this, you need to provide more information than reduce needs: you must define how to start a "chunk" from nothing; your function must be associative; and you must specify how to combine chunks. A. Webb's answer demonstrates how to use fold correctly, to get work done on multiple threads.
However, you're unlikely to get any benefit from folding: in addition to the reason he notes (you give up transients, compared to clojure.core/frequencies), building a map is not easily parallelizable. If the bulk of the work in frequencies were addition (as it would be in something like (frequencies (repeat 1e6 1))), then fold would help; but most of the work is in managing the keys in the hashmap, which really has to be single-threaded eventually. You can build maps in parallel, but then you have to merge them together; since that combination step takes time proportional to the size of the chunk, rather than constant time, you gain little by doing the chunks on a separate thread anyway.

A fold version of your frequencies function would look something like
(defn pfrequencies [coll]
(r/fold
(fn combinef
([] {})
([x y] (merge-with + x y)))
(fn reducef
([counts x] (merge-with + counts {x 1})))
coll))
On 2 cores, it will likely be much slower than clojure.core/frequencies which uses transients. At least on 4 cores, it is faster (2x) than the first implementation, but still slower than clojure.core/frequencies.
You might also experiment with
(defn p2frequencies [coll]
(apply merge-with + (pmap clojure.core/frequencies (partition-all 512 coll))))

Some serious food for thought in the answers here. In this specific case maps should not be needed, since the result domain can be easily predicted and put in a vector where the index can be used. So, a naive implementation of a naive problem would be something like:
(defn freqs
[coll]
(reduce (fn [counts x] (assoc counts x (inc (get counts x))))
(vec (int-array 1000 0))
coll))
(defn rfreqs
[coll]
(r/fold
(fn combinef
([] (vec (int-array 1000 0)))
([& cols] (apply mapv + cols)))
(fn reducef
[counts x] (assoc counts x (inc (get counts x))))
coll))
Here the combinef would be a simple map addition over the 1000 columns of the resulting collections, which should be negligible.
This gives the reducer version a speedup of about 2-3x over the normal version, especially on bigger (10x-100x) datasets. Some twiddling with the partition size of r/fold (optional 'n' parameter) can be done as finetuning. Seemed optimal to use (* 16 1024) with a data size of 1E8 (need 6GB JVM at least).
You could even use transients in both versions, but I didn't notice much improvements.
I know this version isn't appropriate for generic usage, but it might show the speed improvement without the hash management overhead.

Related

How to implement a parallel logical or with early termination in Clojure

I would like to define a predicate that, taking as input some predicates
with corresponding inputs (they could be given as a lazy sequence of calls),
runs them in parallel and computes the logical or of the results,
in such a way that, the moment a predicate call terminates returning true,
the whole computation also terminates (returning true).
Apart from offering time optimization, this would also help avoiding
non-termination in some cases (some predicate calls may not terminate).
Actually, interpreting non-termination as a third undefined value,
this predicate simulates the or operation in Kleene's K3 logic
(the join in the initial centered Kleene algebra).
Something similar is presented here for the Haskell family.
Is there any (preferably simple) way to do this in Clojure?
EDIT: I decided to add some clarifications after reading the comments.
(a) First of all, what happens after the thread pool gets exhausted is of less importance. I think creating a thread pool large enough for our needs is a reasonable convention.
(b) The most crucial requirement is that the predicate calls start running in parallel and, once a predicate call terminates returning true, all the other threads running get interrupted. The intended behavior is that:
if there is a predicate call returning true: the parallel or returns true
else if there is a predicate call that does not terminate: the parallel or does not terminate
else: the parallel or returns false
In other words, it behaves like the join in the 3-element lattice given by false<undefined<true, with undefined representing non-termination.
(c) The parallel or should be able to take as input many predicates and many predicate-inputs (each one corresponding to a predicate). But it would be even better if it took as input a lazy sequence. Then, naming the parallel or pany (for "parallel any"), we could have calls like the following:
(pany (map (comp eval list) predicates inputs))
(pany (map (comp eval list) predicates (repeat input)))
(pany (map (comp eval list) (repeat predicate) inputs)) which is equivalent to (pany (map predicate (unchunk inputs)))
As a final remark, I think that it is quite natural to ask for things like pany, a dual pall or a mechanism for building such early-terminating parallel reductions to be easily implementable or even built-in in a parallelism-oriented language like Clojure.
I will define our predicates in terms of a reducing function. Practically, we could reimplement all of the Clojure iteration functions to support this parallel operation, but I'll just use reduce as an example.
I'll define a computation function. I'll just use the same one, but nothing stopping you from having many. The function is "true" if it accumulates 1000.
(defn computor [acc val]
(let [new (+' acc val)] (if (> new 1000) (reduced new) new)))
(reduce computor 0 (range))
;; =>
1035
(reduce computor 0 (range Long/MIN_VALUE 0))
;; =>
;; ...this is a proxy for a non-returning computation
;; wrap these up in a form suitable for application of reduction
(def predicates [[computor 0 (range)]
[computor 0 (range Long/MIN_VALUE 0)]])
Now let's get to the meat of this. I want to take a step in each computation, and if one of the computations completes, I want to return it. In actual fact one step at a time using pmap is very slow - the units of work are too small to be worth threading. Here I've changed things to do 1000 iterations of each unit of work before moving on. You'd probably tune this based on your workload and the cost of a step.
(defn p-or-reducer* [reductions]
(let [splits (map #(split-at 1000 %) reductions) ;; do at least 1000 iterations per cycle
complete (some #(if (empty? (second %)) (last (first %))) splits)]
(or complete (recur (map second splits)))))
I then wrap this in a driver.
(defn p-or [s]
(p-or-reducer* (map #(apply reductions %) s)))
(p-or predicates)
;; =>
1035
Where to insert the CPU parallelism? s/map/pmap/ in p-or-reducer* should do it. I suggest just parallelising the first operation, as this will drive the reducing sequences to compute.
(defn p-or-reducer* [reductions]
(let [splits (pmap #(split-at 1000 %) reductions) ;; do at least 1000 iterations per cycle
complete (some #(if (empty? (second %)) (last (first %))) splits)]
(or complete (recur (map second splits)))))
(def parallelism-tester (conj (vec (repeat 40000 [computor 0 (range Long/MIN_VALUE 0)]))
[computor 0 (range)]))
(p-or parallelism-tester) ;; terminates even though the first 40K predicates will not
It's extremely hard to define a performant generic version of this. Without knowing the cost per iteration an efficient parallelism strategy is hard to derive - if one iteration take 10s then we'd probably take a single step at a time. If it takes 100ns then we need to take many steps at a time.
Will you consider adopting core.async to handle parallel tasks with async/go or async/thread, and early return with async/alts!?
For example, to turn the core or function from serial into parallel. We can create a macro (I called it por) to wrap the input functions (or predicates) into async/thread and then do a socket select async/alts! on top of them:
(defmacro por [& fns]
`(let [[v# c#] (async/alts!!
[~#(for [f fns]
(list `async/thread f))])]
v#))
(time
(por (do (println "running a") (Thread/sleep 30) :a)
(do (println "running b") (Thread/sleep 20) :b)
(do (println "running c") (Thread/sleep 10) :c)))
;; running a
;; running b
;; running c
;; "Elapsed time: 11.919169 msecs"
;; => :c
In comparison with the original or (which run in serial):
(time
(or (do (println "running a") (Thread/sleep 30) :a)
(do (println "running b") (Thread/sleep 20) :b)
(do (println "running c") (Thread/sleep 10) :c)))
;; running a
;; => :a
;; "Elapsed time: 31.642506 msecs"

Why does `doall` not force the sequence to be counted?

(counted? (map identity (range 100))) ;; false, expected
(time (counted? (doall (map identity (range 100))))) ;; false, unexpected
(time (counted? (into '() (map identity (range 100))))) ;; true, expected - but slower
(Clojure "1.8.0")
The first result is expected since map is lazy.
The second is unexpected for me, since after doall the entire sequence has been realized, is now in memory. Since the implementation probably has to walk through the list anyway, why not count it?
The third is a workaround. Is it idiomatic? Is there an alternative?
It sounds like you already know that lazy sequences are not counted?.
However, in your example, whilst doall realizes the entire sequence, it still returns that result as a LazySeq. Have a look at this REPL output:
user=> (class (doall (map identity (range 100))))
clojure.lang.LazySeq
Using something like into seems to be the right way to go to me; because you need to force your result into a non-lazy sequence. You say into is slower, but it still seems acceptably fast to me.
Nevertheless, you could perhaps improve the time performance by calling vec on your result, instead of into:
user=> (time (counted? (into '() (map identity (range 100)))))
"Elapsed time: 0.287542 msecs"
true
user=> (time (counted? (vec (map identity (range 100)))))
"Elapsed time: 0.169342 msecs"
true
Note: I'm using Clojure 1.9, rather than 1.8 on my machine, so you may see different results.
Update / corrections:
Commenters have respectfully pointed out that:
1) time is terrible for benchmarking, and doesn't really provide any useful evidence in this instance.
2) (vec x) substituted for (list x); (list x) is a constant-time operation no matter what the contents of x.
3) doall returns its input as its output; you get a LazySeq if you passed in a LazySeq, or a map if you passed in a map, etc.

How can I compute the sum of a large list of numbers in parallel using Clojure

I am trying to figure out how to use clojure to efficiently apply a simple operation to a large sequence in parallel. I would like to be able to use the parallel solution to take advantage of the multiple cores on my machine to achieve some speedup.
I am attempting to use pmap in combination with partition-all to reduce the overhead of creating a future for every item in the input seq. Unfortunately, partition-all forces the complete evaluation of each partition seq. This causes an OutOfMemoryError on my machine.
(defn sum [vs]
(reduce + vs))
(def workers
(+ 2 (.. Runtime getRuntime availableProcessors)))
(let
[n 80000000
vs (range n)]
(time (sum vs))
(time (sum (pmap sum (partition-all (long (/ n workers)) vs)))))
How can I apply sum to a large input set, and beat the performance of the serial implementation?
Solution
Thanks to #Arthur Ulfeldt for pointing out the reducers library. Here is the solution using reducers. This code shows the expected performance improvement when running on a multi-core machine. (NOTE: I have changed vs to be a function to make the timing be more accurate)
(require '[clojure.core.reducers :as r])
(let
[n 80000000
vs #(range n)]
(time (reduce + (vs)))
(time (r/fold + (vs)))
When using pmap I have found that fairly large chunks are required to overcome the switching and future overhead try a chunk size of 10,000 for an opperation as fast as +. The potential gains are bounded by the overhead of generating the chunks. This results in an optimal value that balances the available cores and the time required to make the chunks. In this case with + as the workload I was unable to make this faster than the single threaded option.
If you're interested in doing this without pmap and potentially using fork/join check out the new(ish) reducers library
The OOM situation comes from the first test realizing the lazy sequence from (range n) which is then retained so it can be passed to the second sequence.
If I make the + function much slower by defining a slow+ function and use that the diference between single thread, pmap over chunks, and reducers w/ forkJoin become visable:
user> *clojure-version*
{:major 1, :minor 5, :incremental 0, :qualifier "RC15"}
(require '[clojure.core.reducers :as r])
(def workers
(+ 2 (.. Runtime getRuntime availableProcessors)))
(defn slow+
([] 0)
([x] x)
([x y] (reduce + (range 100000)) (+ x y)))
(defn run-test []
(let [n 8000]
(time (reduce slow+ (range n)))
(time (reduce slow+ (pmap #(reduce slow+ %) (partition-all (* workers 100) (range n)))))
(time (r/fold slow+ (vec (range n))))))
user> (run-test)
"Elapsed time: 28655.951241 msecs" ; one thread
"Elapsed time: 6975.488591 msecs" ; pmap over chunks
"Elapsed time: 8170.254426 msecs" ; using reducer

Performance of large maps in Clojure

I have a Clojure program which is using some large maps (1000 - 2000 items) which are accessed 30 - 40 times a second and using Strings as the keys. I was wondering if there is a big performance difference if I used keywords or symbols as the keys instead?
Clojure map lookups are very fast, and do not particularly depend on the size of the map.
In fact, they are almost as fast as pure Java HashMaps, while enjoying many advantages over traditional HashMaps including being immutable and thread-safe.
If you are only doing 30-40 lookups a second then I guarantee you will never notice the difference regardless of what you use as keys. Worrying about this would count as premature optimisation.
Let's prove it: the following code does a million map lookups using strings as keys:
(def str-keys (map str (range 1000)))
(def m (zipmap str-keys (range 1000)))
(time (dotimes [i 1000] (doseq [k str-keys] (m k))))
=> "Elapsed time: 69.082224 msecs"
The following does a million map lookups using keywords as keys:
(def kw-keys (map #(keyword (str %)) (range 1000)))
(def m (zipmap kw-keys (range 1000)))
(time (dotimes [i 1000] (doseq [k kw-keys] (m k))))
=> "Elapsed time: 59.212864 msecs"
And for symbols:
(def sym-keys (map #(symbol (str %)) (range 1000)))
(def m (zipmap sym-keys (range 1000)))
(time (dotimes [i 1000] (doseq [k sym-keys] (m k))))
=> "Elapsed time: 61.590925 msecs"
In my tests, Symbols and Keywords were slightly faster than Strings, but still the difference could easily be explained by statistical error, and the average execution time per lookup was less than 100 nanoseconds for all cases.
So your 30-40 lookups are probably taking in the order of 0.001% of your CPU time (this even allows for the fact that in a real app, lookups will probably be a few times slower due to caching issues)
The likely reason for Keywords in particular being slightly faster is that they are interned (and can therefore use reference equality to check for equality). But as you can see the difference is sufficiently small that you really don't need to worry about it.

Is there a way to control the number of threads used with pmap?

I am just doing some performance testing with clojure using pmap and I would like to be able to control the number of threads being used with pmap. I know when using something like OpenMP one can set the number of threads using omp_set_num_threads(). I was wondering if there would be anything similar in clojure.
Here's code for pmap:
(defn pmap
"Like map, except f is applied in parallel. Semi-lazy in that the
parallel computation stays ahead of the consumption, but doesn't
realize the entire result unless required. Only useful for
computationally intensive functions where the time of f dominates
the coordination overhead."
([f coll]
(let [n (+ 2 (.. Runtime getRuntime availableProcessors))
rets (map #(future (f %)) coll)
step (fn step [[x & xs :as vs] fs]
(lazy-seq
(if-let [s (seq fs)]
(cons (deref x) (step xs (rest s)))
(map deref vs))))]
(step rets (drop n rets))))
As you can see, pmap takes all available processors and uses them cyclically. So, no, there's no way to set the number of threads... but you can always write your own pmap, which will provide such functionality.