I am beginner to Clojure, while trying to read about Reducers I found something called foldable collection.
They are mentioning that vectors and maps are foldable collection, but not the list.
I am trying to understand what is foldable collection, why vectors and maps are foldable ?
I have not found any definition or explanation for foldable collection.
The answer is there in the docs, if not quite as clear as it could be:
Additionally, some collections (persistent vectors and maps) are
foldable. The fold operation on a reducer executes the reduction in
parallel...
The idea is that, with modern hardware, a "reduction" operation like summing all elements of a vector can be done in parallel. For example, if summing all elements of a 400K length vector, we could break them up into 4 groups of 100K chunks, sum those in parallel, then combine the 4 subtotals into the final answer. This would be approximately 4x faster than using only a single thread (single cpu core).
Reducers live in the clojure.core.reducers namespace. Assume we define aliases like:
( ns demo.xyz
(:require [clojure.core :as core]
[clojure.core.reducers :as r] ))
Compared to clojure.core, we have:
core/reduce <=> r/fold ; new name for `reduce`
core/map <=> r/map ; same name for `map`
core/filter <=> r/filter ; same name for `filter`
So, the naming is not the best. reduce lives in the clojure.core namespace, but there is no reduce in the clojure.core.reducers namespace. Instead, there is a work-alike function named fold in clojure.core.reducers.
Note that fold is a historical name for combining lists of data as with our summation example. See the Wikipedia entry for more information.
Because folding accesses the data in non-linear order (which is very ineffecient for linked lists), folding is only worth doing on random-access data structures like vectors).
Update #1:
Having said the above, remember the adage that "Premature optimization is the root of all evil." Here are some measurements for (vec (range 1e7)), i.e. 10M entries, on an 8-core machine:
(time (reduce + data))
"Elapsed time: 284.52735 msecs"
"Elapsed time: 119.310289 msecs"
"Elapsed time: 98.740421 msecs"
"Elapsed time: 100.58998 msecs"
"Elapsed time: 98.642878 msecs"
"Elapsed time: 105.021808 msecs"
"Elapsed time: 99.886083 msecs"
"Elapsed time: 98.49152 msecs"
"Elapsed time: 99.879767 msecs"
(time (r/fold + data))
"Elapsed time: 61.67537 msecs"
"Elapsed time: 56.811961 msecs"
"Elapsed time: 55.613058 msecs"
"Elapsed time: 58.359599 msecs"
"Elapsed time: 55.299767 msecs"
"Elapsed time: 62.989939 msecs"
"Elapsed time: 56.518486 msecs"
"Elapsed time: 54.218251 msecs"
"Elapsed time: 54.438623 msecs"
Criterium reports:
reduce 144 ms
r/fold 72 ms
Update #2
Rich Hickey talked about the design of transducers/reducers at the 2014 Clojure Conj. You may find these details useful. The basic idea is that the folding is delegated to each collection type, which uses knowledge of its implementation details to perform the fold efficiently.
Since hash-maps use a vector internally, they can fold in parallel efficiently.
There is this talk by Guy Steele which predates reducers and might just have served as an inspiration for them.
https://vimeo.com/6624203
Related
If I get a sorted list of objects from an external API, is there a way to put it in a sorted set without the overhead of re-sorting it? Something like:
=> (sorted? (assume-sorted [1 2 3]))
true
Clojure uses a persistant Red/Black Tree data structure for sorted sets & maps. When an inserted item makes the tree too unbalanced, the root & nodes of the tree are rearranged to keep itself "approximately" balanced.
What your measurement shows is that there is slightly more overhead in rebalancing a tree that only grows on the right (every new addition unbalances the tree further to the right) compared to a tree that grows in random locations (some insertions will randomly make the tree more balanced).
See:
https://en.wikipedia.org/wiki/Red%E2%80%93black_tree
https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/PersistentTreeMap.java
Update
I just tried on my computer and get very different results than your test. This once again shows the folly of trying to optimize prematurely (especially if the change is less than 2x):
(def x (range 1000000))
(def y (doall (shuffle x)))
parse.core=> (time (doall (set x) nil))
"Elapsed time: 279.259531 msecs"
"Elapsed time: 291.31022 msecs"
"Elapsed time: 414.752484 msecs"
parse.core=> (time (doall (set y) nil))
"Elapsed time: 286.496324 msecs"
"Elapsed time: 284.95049 msecs"
"Elapsed time: 285.568222 msecs"
"Elapsed time: 301.55659 msecs"
parse.core=> (time (doall (into (sorted-set) x) nil))
"Elapsed time: 816.473169 msecs"
"Elapsed time: 775.025901 msecs"
"Elapsed time: 763.638447 msecs"
parse.core=> (time (doall (into (sorted-set) y) nil))
"Elapsed time: 1307.969889 msecs"
"Elapsed time: 1313.099123 msecs"
"Elapsed time: 1285.665797 msecs"
"Elapsed time: 1307.879676 msecs"
The Moral of the Story
First make it right
If it is fast enough, move on to the next problem
If it needs to be faster, measure where the biggest bottleneck is
Decide if it's cheaper to just use more h/w at $0.03/hr or to spend human time on code changes (which will increase complexity & reduce maintainability etc).
I copied the code from:
http://clojure.org/transients
but my results differ signifiantly from what was posted.
(defn vrange [n]
(loop [i 0 v []]
(if (< i n)
(recur (inc i) (conj v i))
v)))
(defn vrange2 [n]
(loop [i 0 v (transient [])]
(if (< i n)
(recur (inc i) (conj! v i))
(persistent! v))))
(quick-bench (def v (vrange 1000000)))
"Elapsed time: 459.59 msecs"
(quick-bench (def v2 (vrange2 1000000)))
"Elapsed time: 379.85 msecs"
That's a slight speedup, but nothing like the 8x boost implied in the example docs?
Starting java in server mode changes the story, but still nothing like the docs..
(quick-bench (def v (vrange 1000000)))
"Elapsed time: 121.14 msecs"
(quick-bench (def v2 (vrange2 1000000)))
"Elapsed time: 75.15 msecs"
Is it that the persistent implementations have improved since the post about transients here: http://clojure.org/transients ?
What other factors might be contributing to the lack of boost with transients?
I'm using the OpenJDK java version 1.7 on ubuntu 12.04. Maybe that's a lot slower than the (presumed) Hotspot 1.6 version used in the docs? But wouldn't that imply BOTH tests should be slower by some constant, with the same gap?
Your results are consistent with my experience with transients. I've used them quite a bit and I typically see a 2x performance improvement.
I tried this on Ubuntu 12.04, OpenJDK 1.7 with Clojure 1.6.0 and 1.7.0-alpha3. I get 2x performance with transients, slightly less than the 3x I get on OSX with the 1.8 Oracle jvm.
Also the above page is from the time of Clojure 1.2, and the performance of collections has improved significantly since then. I tried the experiment with 1.2 but Criterium doesn't work with it, so I had to use time just like on that page. Obviously the results are significantly variable (from 2x to 8x). I suspect the example in the documentation may have been cherry-picked.
(require '[clojure.core.reducers :as r])
(def data (into [] (take 10000000 (repeatedly #(rand-int 1000)))))
(defn frequencies [coll]
(reduce (fn [counts x]
(merge-with + counts {x 1}))
{} coll))
(defn pfrequencies [coll]
(r/reduce (fn [counts x]
(merge-with + counts {x 1}))
{} coll))
user=> (time (do (frequencies data) nil))
"Elapsed time: 29697.183 msecs"
user=> (time (do (pfrequencies data) nil))
"Elapsed time: 25273.794 msecs"
user=> (time (do (frequencies data) nil))
"Elapsed time: 25384.086 msecs"
user=> (time (do (pfrequencies data) nil))
"Elapsed time: 25778.502 msecs"
And who can show me an example with significant speedup?
I'm running on Mac OSX 10.7.5 with Java 1.7 on an Intel Core i7 (2 cores, http://ark.intel.com/products/54617).
You called it pfrequencies, which, along with your parallel-processing tag on the question, suggests you think that something is using multiple threads here. That is not the case, and neither is it the "main" goal of the reducers library.
The main thing reducers buy you is that you don't need to allocate many intermediate cons cells for your lazy sequences. Before reducers were introduced, frequencies would allocate 10000000 cons cells to create a sequential view of the vector for reduce to use. Now that reducers exist, vectors know how to reduce themselves without creating such temporary objects. But that feature has been backported into clojure.core/reduce, which behaves exactly like r/reduce (ignoring some minor features that are irrelevant here). So you are just benchmarking your function against an identical clone of itself.
The reducers library also includes the notion of a fold, which can do some work in parallel, and then later merge together the intermediate results. To use this, you need to provide more information than reduce needs: you must define how to start a "chunk" from nothing; your function must be associative; and you must specify how to combine chunks. A. Webb's answer demonstrates how to use fold correctly, to get work done on multiple threads.
However, you're unlikely to get any benefit from folding: in addition to the reason he notes (you give up transients, compared to clojure.core/frequencies), building a map is not easily parallelizable. If the bulk of the work in frequencies were addition (as it would be in something like (frequencies (repeat 1e6 1))), then fold would help; but most of the work is in managing the keys in the hashmap, which really has to be single-threaded eventually. You can build maps in parallel, but then you have to merge them together; since that combination step takes time proportional to the size of the chunk, rather than constant time, you gain little by doing the chunks on a separate thread anyway.
A fold version of your frequencies function would look something like
(defn pfrequencies [coll]
(r/fold
(fn combinef
([] {})
([x y] (merge-with + x y)))
(fn reducef
([counts x] (merge-with + counts {x 1})))
coll))
On 2 cores, it will likely be much slower than clojure.core/frequencies which uses transients. At least on 4 cores, it is faster (2x) than the first implementation, but still slower than clojure.core/frequencies.
You might also experiment with
(defn p2frequencies [coll]
(apply merge-with + (pmap clojure.core/frequencies (partition-all 512 coll))))
Some serious food for thought in the answers here. In this specific case maps should not be needed, since the result domain can be easily predicted and put in a vector where the index can be used. So, a naive implementation of a naive problem would be something like:
(defn freqs
[coll]
(reduce (fn [counts x] (assoc counts x (inc (get counts x))))
(vec (int-array 1000 0))
coll))
(defn rfreqs
[coll]
(r/fold
(fn combinef
([] (vec (int-array 1000 0)))
([& cols] (apply mapv + cols)))
(fn reducef
[counts x] (assoc counts x (inc (get counts x))))
coll))
Here the combinef would be a simple map addition over the 1000 columns of the resulting collections, which should be negligible.
This gives the reducer version a speedup of about 2-3x over the normal version, especially on bigger (10x-100x) datasets. Some twiddling with the partition size of r/fold (optional 'n' parameter) can be done as finetuning. Seemed optimal to use (* 16 1024) with a data size of 1E8 (need 6GB JVM at least).
You could even use transients in both versions, but I didn't notice much improvements.
I know this version isn't appropriate for generic usage, but it might show the speed improvement without the hash management overhead.
I am trying to figure out how to use clojure to efficiently apply a simple operation to a large sequence in parallel. I would like to be able to use the parallel solution to take advantage of the multiple cores on my machine to achieve some speedup.
I am attempting to use pmap in combination with partition-all to reduce the overhead of creating a future for every item in the input seq. Unfortunately, partition-all forces the complete evaluation of each partition seq. This causes an OutOfMemoryError on my machine.
(defn sum [vs]
(reduce + vs))
(def workers
(+ 2 (.. Runtime getRuntime availableProcessors)))
(let
[n 80000000
vs (range n)]
(time (sum vs))
(time (sum (pmap sum (partition-all (long (/ n workers)) vs)))))
How can I apply sum to a large input set, and beat the performance of the serial implementation?
Solution
Thanks to #Arthur Ulfeldt for pointing out the reducers library. Here is the solution using reducers. This code shows the expected performance improvement when running on a multi-core machine. (NOTE: I have changed vs to be a function to make the timing be more accurate)
(require '[clojure.core.reducers :as r])
(let
[n 80000000
vs #(range n)]
(time (reduce + (vs)))
(time (r/fold + (vs)))
When using pmap I have found that fairly large chunks are required to overcome the switching and future overhead try a chunk size of 10,000 for an opperation as fast as +. The potential gains are bounded by the overhead of generating the chunks. This results in an optimal value that balances the available cores and the time required to make the chunks. In this case with + as the workload I was unable to make this faster than the single threaded option.
If you're interested in doing this without pmap and potentially using fork/join check out the new(ish) reducers library
The OOM situation comes from the first test realizing the lazy sequence from (range n) which is then retained so it can be passed to the second sequence.
If I make the + function much slower by defining a slow+ function and use that the diference between single thread, pmap over chunks, and reducers w/ forkJoin become visable:
user> *clojure-version*
{:major 1, :minor 5, :incremental 0, :qualifier "RC15"}
(require '[clojure.core.reducers :as r])
(def workers
(+ 2 (.. Runtime getRuntime availableProcessors)))
(defn slow+
([] 0)
([x] x)
([x y] (reduce + (range 100000)) (+ x y)))
(defn run-test []
(let [n 8000]
(time (reduce slow+ (range n)))
(time (reduce slow+ (pmap #(reduce slow+ %) (partition-all (* workers 100) (range n)))))
(time (r/fold slow+ (vec (range n))))))
user> (run-test)
"Elapsed time: 28655.951241 msecs" ; one thread
"Elapsed time: 6975.488591 msecs" ; pmap over chunks
"Elapsed time: 8170.254426 msecs" ; using reducer
I have a Clojure program which is using some large maps (1000 - 2000 items) which are accessed 30 - 40 times a second and using Strings as the keys. I was wondering if there is a big performance difference if I used keywords or symbols as the keys instead?
Clojure map lookups are very fast, and do not particularly depend on the size of the map.
In fact, they are almost as fast as pure Java HashMaps, while enjoying many advantages over traditional HashMaps including being immutable and thread-safe.
If you are only doing 30-40 lookups a second then I guarantee you will never notice the difference regardless of what you use as keys. Worrying about this would count as premature optimisation.
Let's prove it: the following code does a million map lookups using strings as keys:
(def str-keys (map str (range 1000)))
(def m (zipmap str-keys (range 1000)))
(time (dotimes [i 1000] (doseq [k str-keys] (m k))))
=> "Elapsed time: 69.082224 msecs"
The following does a million map lookups using keywords as keys:
(def kw-keys (map #(keyword (str %)) (range 1000)))
(def m (zipmap kw-keys (range 1000)))
(time (dotimes [i 1000] (doseq [k kw-keys] (m k))))
=> "Elapsed time: 59.212864 msecs"
And for symbols:
(def sym-keys (map #(symbol (str %)) (range 1000)))
(def m (zipmap sym-keys (range 1000)))
(time (dotimes [i 1000] (doseq [k sym-keys] (m k))))
=> "Elapsed time: 61.590925 msecs"
In my tests, Symbols and Keywords were slightly faster than Strings, but still the difference could easily be explained by statistical error, and the average execution time per lookup was less than 100 nanoseconds for all cases.
So your 30-40 lookups are probably taking in the order of 0.001% of your CPU time (this even allows for the fact that in a real app, lookups will probably be a few times slower due to caching issues)
The likely reason for Keywords in particular being slightly faster is that they are interned (and can therefore use reference equality to check for equality). But as you can see the difference is sufficiently small that you really don't need to worry about it.