Clojure Transients Example - No significant speedup - clojure

I copied the code from:
http://clojure.org/transients
but my results differ signifiantly from what was posted.
(defn vrange [n]
(loop [i 0 v []]
(if (< i n)
(recur (inc i) (conj v i))
v)))
(defn vrange2 [n]
(loop [i 0 v (transient [])]
(if (< i n)
(recur (inc i) (conj! v i))
(persistent! v))))
(quick-bench (def v (vrange 1000000)))
"Elapsed time: 459.59 msecs"
(quick-bench (def v2 (vrange2 1000000)))
"Elapsed time: 379.85 msecs"
That's a slight speedup, but nothing like the 8x boost implied in the example docs?
Starting java in server mode changes the story, but still nothing like the docs..
(quick-bench (def v (vrange 1000000)))
"Elapsed time: 121.14 msecs"
(quick-bench (def v2 (vrange2 1000000)))
"Elapsed time: 75.15 msecs"
Is it that the persistent implementations have improved since the post about transients here: http://clojure.org/transients ?
What other factors might be contributing to the lack of boost with transients?
I'm using the OpenJDK java version 1.7 on ubuntu 12.04. Maybe that's a lot slower than the (presumed) Hotspot 1.6 version used in the docs? But wouldn't that imply BOTH tests should be slower by some constant, with the same gap?

Your results are consistent with my experience with transients. I've used them quite a bit and I typically see a 2x performance improvement.
I tried this on Ubuntu 12.04, OpenJDK 1.7 with Clojure 1.6.0 and 1.7.0-alpha3. I get 2x performance with transients, slightly less than the 3x I get on OSX with the 1.8 Oracle jvm.
Also the above page is from the time of Clojure 1.2, and the performance of collections has improved significantly since then. I tried the experiment with 1.2 but Criterium doesn't work with it, so I had to use time just like on that page. Obviously the results are significantly variable (from 2x to 8x). I suspect the example in the documentation may have been cherry-picked.

Related

OutOfMemoryError: GC overhead limit exceeded Criterium

I am trying to benchmark an expression using the Criterium library. The expression is
(vec (range 10000000))
To benchmark it i type
(criterium.core/bench (vec (range 10000000)))
and after a while i get
OutOfMemoryError GC overhead limit exceeded java.lang.Long.valueOf (Long.java:840)
As i have seen here this means that the maximum size of the heap (1 GB) is not enough for the data to fit and the garbage collector tries to free space but is unable to do so. However, microbenchmarking the expression like below doesn't produce this error
(dotimes [i 60] (time (vec (range 10000000))))
By the way, i set it to 60 times because i have seen here that the bench macro does 60 executions by default.
The question is why is this happening when using Criterium.
Edit: When starting a fresh repl the code below
{:max (.maxMemory (Runtime/getRuntime)), :total (.totalMemory (Runtime/getRuntime))}
outputs
{:max 922746880, :total 212860928}
After i run (dotimes [i 60] (time (vec (range 10000000)))) or (criterium.core/bench (vec (range 10000000)))
it outputs
{:max 922746880, :total 922746880}
I was able to reproduce the behavior by using this test:
;project.clj
:profiles {:test {:jvm-opts ["-Xms1024m" "-Xmx1024m"]}}
(:require [clojure.test :refer :all]
[criterium.core :as ben])
(deftest ^:focused ben-test
(is (ben/with-progress-reporting
(ben/bench (vec (range 10000000))))))
The stack trace looks like this:
Estimating sampling overhead
Warming up for JIT optimisations 10000000000 ...
compilation occurred before 377618 iterations
...
Estimating execution count ...
Sampling ...
Final GC...
Checking GC...
Finding outliers ...
Bootstrapping ...
Checking outlier significance
Warming up for JIT optimisations 10000000000 ...
compilation occurred before 1 iterations
criterium.core$execute_expr_core_timed_part$fn__40395.invoke (core.clj:370)
criterium.core$execute_expr_core_timed_part.invokeStatic (core.clj:366)
criterium.core$execute_expr_core_timed_part.invoke (core.clj:345)
criterium.core$execute_expr.invokeStatic (core.clj:378)
criterium.core$execute_expr.invoke (core.clj:374)
criterium.core$warmup_for_jit.invokeStatic (core.clj:428)
criterium.core$warmup_for_jit.invoke (core.clj:396)
criterium.core$run_benchmark.invokeStatic (core.clj:479)
criterium.core$run_benchmark.invoke (core.clj:470)
criterium.core$benchmark_STAR_.invokeStatic (core.clj:826)
criterium.core$benchmark_STAR_.invoke (core.clj:812)
We can see here that the error occurs in the JIT-Warning-Up step. The interesting point is the function execute-expr-core-timed-part (core.clj:345). This functions performs the expression (vec (range 10000000)) n times and saves the returned value every time to a so-called mutable place. My hypothesis it that we have the memory leak here.
(time-body
(loop [i (long (dec n))
v (f)]
==> (set-place mutable-place v**)
(if (pos? i)
(recur (unchecked-dec i) (f))
v)))

Why does `doall` not force the sequence to be counted?

(counted? (map identity (range 100))) ;; false, expected
(time (counted? (doall (map identity (range 100))))) ;; false, unexpected
(time (counted? (into '() (map identity (range 100))))) ;; true, expected - but slower
(Clojure "1.8.0")
The first result is expected since map is lazy.
The second is unexpected for me, since after doall the entire sequence has been realized, is now in memory. Since the implementation probably has to walk through the list anyway, why not count it?
The third is a workaround. Is it idiomatic? Is there an alternative?
It sounds like you already know that lazy sequences are not counted?.
However, in your example, whilst doall realizes the entire sequence, it still returns that result as a LazySeq. Have a look at this REPL output:
user=> (class (doall (map identity (range 100))))
clojure.lang.LazySeq
Using something like into seems to be the right way to go to me; because you need to force your result into a non-lazy sequence. You say into is slower, but it still seems acceptably fast to me.
Nevertheless, you could perhaps improve the time performance by calling vec on your result, instead of into:
user=> (time (counted? (into '() (map identity (range 100)))))
"Elapsed time: 0.287542 msecs"
true
user=> (time (counted? (vec (map identity (range 100)))))
"Elapsed time: 0.169342 msecs"
true
Note: I'm using Clojure 1.9, rather than 1.8 on my machine, so you may see different results.
Update / corrections:
Commenters have respectfully pointed out that:
1) time is terrible for benchmarking, and doesn't really provide any useful evidence in this instance.
2) (vec x) substituted for (list x); (list x) is a constant-time operation no matter what the contents of x.
3) doall returns its input as its output; you get a LazySeq if you passed in a LazySeq, or a map if you passed in a map, etc.

Why is there no significant speedup using reducers in this example?

(require '[clojure.core.reducers :as r])
(def data (into [] (take 10000000 (repeatedly #(rand-int 1000)))))
(defn frequencies [coll]
(reduce (fn [counts x]
(merge-with + counts {x 1}))
{} coll))
(defn pfrequencies [coll]
(r/reduce (fn [counts x]
(merge-with + counts {x 1}))
{} coll))
user=> (time (do (frequencies data) nil))
"Elapsed time: 29697.183 msecs"
user=> (time (do (pfrequencies data) nil))
"Elapsed time: 25273.794 msecs"
user=> (time (do (frequencies data) nil))
"Elapsed time: 25384.086 msecs"
user=> (time (do (pfrequencies data) nil))
"Elapsed time: 25778.502 msecs"
And who can show me an example with significant speedup?
I'm running on Mac OSX 10.7.5 with Java 1.7 on an Intel Core i7 (2 cores, http://ark.intel.com/products/54617).
You called it pfrequencies, which, along with your parallel-processing tag on the question, suggests you think that something is using multiple threads here. That is not the case, and neither is it the "main" goal of the reducers library.
The main thing reducers buy you is that you don't need to allocate many intermediate cons cells for your lazy sequences. Before reducers were introduced, frequencies would allocate 10000000 cons cells to create a sequential view of the vector for reduce to use. Now that reducers exist, vectors know how to reduce themselves without creating such temporary objects. But that feature has been backported into clojure.core/reduce, which behaves exactly like r/reduce (ignoring some minor features that are irrelevant here). So you are just benchmarking your function against an identical clone of itself.
The reducers library also includes the notion of a fold, which can do some work in parallel, and then later merge together the intermediate results. To use this, you need to provide more information than reduce needs: you must define how to start a "chunk" from nothing; your function must be associative; and you must specify how to combine chunks. A. Webb's answer demonstrates how to use fold correctly, to get work done on multiple threads.
However, you're unlikely to get any benefit from folding: in addition to the reason he notes (you give up transients, compared to clojure.core/frequencies), building a map is not easily parallelizable. If the bulk of the work in frequencies were addition (as it would be in something like (frequencies (repeat 1e6 1))), then fold would help; but most of the work is in managing the keys in the hashmap, which really has to be single-threaded eventually. You can build maps in parallel, but then you have to merge them together; since that combination step takes time proportional to the size of the chunk, rather than constant time, you gain little by doing the chunks on a separate thread anyway.
A fold version of your frequencies function would look something like
(defn pfrequencies [coll]
(r/fold
(fn combinef
([] {})
([x y] (merge-with + x y)))
(fn reducef
([counts x] (merge-with + counts {x 1})))
coll))
On 2 cores, it will likely be much slower than clojure.core/frequencies which uses transients. At least on 4 cores, it is faster (2x) than the first implementation, but still slower than clojure.core/frequencies.
You might also experiment with
(defn p2frequencies [coll]
(apply merge-with + (pmap clojure.core/frequencies (partition-all 512 coll))))
Some serious food for thought in the answers here. In this specific case maps should not be needed, since the result domain can be easily predicted and put in a vector where the index can be used. So, a naive implementation of a naive problem would be something like:
(defn freqs
[coll]
(reduce (fn [counts x] (assoc counts x (inc (get counts x))))
(vec (int-array 1000 0))
coll))
(defn rfreqs
[coll]
(r/fold
(fn combinef
([] (vec (int-array 1000 0)))
([& cols] (apply mapv + cols)))
(fn reducef
[counts x] (assoc counts x (inc (get counts x))))
coll))
Here the combinef would be a simple map addition over the 1000 columns of the resulting collections, which should be negligible.
This gives the reducer version a speedup of about 2-3x over the normal version, especially on bigger (10x-100x) datasets. Some twiddling with the partition size of r/fold (optional 'n' parameter) can be done as finetuning. Seemed optimal to use (* 16 1024) with a data size of 1E8 (need 6GB JVM at least).
You could even use transients in both versions, but I didn't notice much improvements.
I know this version isn't appropriate for generic usage, but it might show the speed improvement without the hash management overhead.

How can I compute the sum of a large list of numbers in parallel using Clojure

I am trying to figure out how to use clojure to efficiently apply a simple operation to a large sequence in parallel. I would like to be able to use the parallel solution to take advantage of the multiple cores on my machine to achieve some speedup.
I am attempting to use pmap in combination with partition-all to reduce the overhead of creating a future for every item in the input seq. Unfortunately, partition-all forces the complete evaluation of each partition seq. This causes an OutOfMemoryError on my machine.
(defn sum [vs]
(reduce + vs))
(def workers
(+ 2 (.. Runtime getRuntime availableProcessors)))
(let
[n 80000000
vs (range n)]
(time (sum vs))
(time (sum (pmap sum (partition-all (long (/ n workers)) vs)))))
How can I apply sum to a large input set, and beat the performance of the serial implementation?
Solution
Thanks to #Arthur Ulfeldt for pointing out the reducers library. Here is the solution using reducers. This code shows the expected performance improvement when running on a multi-core machine. (NOTE: I have changed vs to be a function to make the timing be more accurate)
(require '[clojure.core.reducers :as r])
(let
[n 80000000
vs #(range n)]
(time (reduce + (vs)))
(time (r/fold + (vs)))
When using pmap I have found that fairly large chunks are required to overcome the switching and future overhead try a chunk size of 10,000 for an opperation as fast as +. The potential gains are bounded by the overhead of generating the chunks. This results in an optimal value that balances the available cores and the time required to make the chunks. In this case with + as the workload I was unable to make this faster than the single threaded option.
If you're interested in doing this without pmap and potentially using fork/join check out the new(ish) reducers library
The OOM situation comes from the first test realizing the lazy sequence from (range n) which is then retained so it can be passed to the second sequence.
If I make the + function much slower by defining a slow+ function and use that the diference between single thread, pmap over chunks, and reducers w/ forkJoin become visable:
user> *clojure-version*
{:major 1, :minor 5, :incremental 0, :qualifier "RC15"}
(require '[clojure.core.reducers :as r])
(def workers
(+ 2 (.. Runtime getRuntime availableProcessors)))
(defn slow+
([] 0)
([x] x)
([x y] (reduce + (range 100000)) (+ x y)))
(defn run-test []
(let [n 8000]
(time (reduce slow+ (range n)))
(time (reduce slow+ (pmap #(reduce slow+ %) (partition-all (* workers 100) (range n)))))
(time (r/fold slow+ (vec (range n))))))
user> (run-test)
"Elapsed time: 28655.951241 msecs" ; one thread
"Elapsed time: 6975.488591 msecs" ; pmap over chunks
"Elapsed time: 8170.254426 msecs" ; using reducer

Performance of large maps in Clojure

I have a Clojure program which is using some large maps (1000 - 2000 items) which are accessed 30 - 40 times a second and using Strings as the keys. I was wondering if there is a big performance difference if I used keywords or symbols as the keys instead?
Clojure map lookups are very fast, and do not particularly depend on the size of the map.
In fact, they are almost as fast as pure Java HashMaps, while enjoying many advantages over traditional HashMaps including being immutable and thread-safe.
If you are only doing 30-40 lookups a second then I guarantee you will never notice the difference regardless of what you use as keys. Worrying about this would count as premature optimisation.
Let's prove it: the following code does a million map lookups using strings as keys:
(def str-keys (map str (range 1000)))
(def m (zipmap str-keys (range 1000)))
(time (dotimes [i 1000] (doseq [k str-keys] (m k))))
=> "Elapsed time: 69.082224 msecs"
The following does a million map lookups using keywords as keys:
(def kw-keys (map #(keyword (str %)) (range 1000)))
(def m (zipmap kw-keys (range 1000)))
(time (dotimes [i 1000] (doseq [k kw-keys] (m k))))
=> "Elapsed time: 59.212864 msecs"
And for symbols:
(def sym-keys (map #(symbol (str %)) (range 1000)))
(def m (zipmap sym-keys (range 1000)))
(time (dotimes [i 1000] (doseq [k sym-keys] (m k))))
=> "Elapsed time: 61.590925 msecs"
In my tests, Symbols and Keywords were slightly faster than Strings, but still the difference could easily be explained by statistical error, and the average execution time per lookup was less than 100 nanoseconds for all cases.
So your 30-40 lookups are probably taking in the order of 0.001% of your CPU time (this even allows for the fact that in a real app, lookups will probably be a few times slower due to caching issues)
The likely reason for Keywords in particular being slightly faster is that they are interned (and can therefore use reference equality to check for equality). But as you can see the difference is sufficiently small that you really don't need to worry about it.