I came across this while tuning some performance sensitive code:
user> (use 'criterium.core)
nil
user> (def n (into {} (for [i (range 20000) :let [k (keyword (str i))]] [k {k k}])))
#'user/n
user> (quick-bench (-> n :1 :1))
WARNING: Final GC required 32.5115186521176 % of runtime
Evaluation count : 15509754 in 6 samples of 2584959 calls.
Execution time mean : 36.256135 ns
Execution time std-deviation : 1.076403 ns
Execution time lower quantile : 35.120871 ns ( 2.5%)
Execution time upper quantile : 37.470993 ns (97.5%)
Overhead used : 1.755171 ns
nil
user> (quick-bench (get-in n [:1 :1]))
WARNING: Final GC required 33.11057826481865 % of runtime
Evaluation count : 7681728 in 6 samples of 1280288 calls.
Execution time mean : 81.023429 ns
Execution time std-deviation : 3.244516 ns
Execution time lower quantile : 78.220643 ns ( 2.5%)
Execution time upper quantile : 85.906898 ns (97.5%)
Overhead used : 1.755171 ns
nil
It's unintuitive to me that get-in is more than twice as slow as threading through gets here as get-in seems to be defined as the better abstraction for this sort of thing.
Does anyone have any insight into why this is the case (both technically and philosophically)?
Nested maps are very commonly used in Clojure programs. This is a good thing. But there can be occasions where nested map operations such as assoc-in and get-in may be improved by unrolling. (get :a (get :b (get :c (get :d m))) is not the same thing as (get-in m [:d :c :b :a]) in terms of the byte code produced. The byte code of the later results in worse execution time.
Note that Clojure has some pending patches http://dev.clojure.org/jira/browse/CLJ-1656 related to this.
Related
I would like to define a predicate that, taking as input some predicates
with corresponding inputs (they could be given as a lazy sequence of calls),
runs them in parallel and computes the logical or of the results,
in such a way that, the moment a predicate call terminates returning true,
the whole computation also terminates (returning true).
Apart from offering time optimization, this would also help avoiding
non-termination in some cases (some predicate calls may not terminate).
Actually, interpreting non-termination as a third undefined value,
this predicate simulates the or operation in Kleene's K3 logic
(the join in the initial centered Kleene algebra).
Something similar is presented here for the Haskell family.
Is there any (preferably simple) way to do this in Clojure?
EDIT: I decided to add some clarifications after reading the comments.
(a) First of all, what happens after the thread pool gets exhausted is of less importance. I think creating a thread pool large enough for our needs is a reasonable convention.
(b) The most crucial requirement is that the predicate calls start running in parallel and, once a predicate call terminates returning true, all the other threads running get interrupted. The intended behavior is that:
if there is a predicate call returning true: the parallel or returns true
else if there is a predicate call that does not terminate: the parallel or does not terminate
else: the parallel or returns false
In other words, it behaves like the join in the 3-element lattice given by false<undefined<true, with undefined representing non-termination.
(c) The parallel or should be able to take as input many predicates and many predicate-inputs (each one corresponding to a predicate). But it would be even better if it took as input a lazy sequence. Then, naming the parallel or pany (for "parallel any"), we could have calls like the following:
(pany (map (comp eval list) predicates inputs))
(pany (map (comp eval list) predicates (repeat input)))
(pany (map (comp eval list) (repeat predicate) inputs)) which is equivalent to (pany (map predicate (unchunk inputs)))
As a final remark, I think that it is quite natural to ask for things like pany, a dual pall or a mechanism for building such early-terminating parallel reductions to be easily implementable or even built-in in a parallelism-oriented language like Clojure.
I will define our predicates in terms of a reducing function. Practically, we could reimplement all of the Clojure iteration functions to support this parallel operation, but I'll just use reduce as an example.
I'll define a computation function. I'll just use the same one, but nothing stopping you from having many. The function is "true" if it accumulates 1000.
(defn computor [acc val]
(let [new (+' acc val)] (if (> new 1000) (reduced new) new)))
(reduce computor 0 (range))
;; =>
1035
(reduce computor 0 (range Long/MIN_VALUE 0))
;; =>
;; ...this is a proxy for a non-returning computation
;; wrap these up in a form suitable for application of reduction
(def predicates [[computor 0 (range)]
[computor 0 (range Long/MIN_VALUE 0)]])
Now let's get to the meat of this. I want to take a step in each computation, and if one of the computations completes, I want to return it. In actual fact one step at a time using pmap is very slow - the units of work are too small to be worth threading. Here I've changed things to do 1000 iterations of each unit of work before moving on. You'd probably tune this based on your workload and the cost of a step.
(defn p-or-reducer* [reductions]
(let [splits (map #(split-at 1000 %) reductions) ;; do at least 1000 iterations per cycle
complete (some #(if (empty? (second %)) (last (first %))) splits)]
(or complete (recur (map second splits)))))
I then wrap this in a driver.
(defn p-or [s]
(p-or-reducer* (map #(apply reductions %) s)))
(p-or predicates)
;; =>
1035
Where to insert the CPU parallelism? s/map/pmap/ in p-or-reducer* should do it. I suggest just parallelising the first operation, as this will drive the reducing sequences to compute.
(defn p-or-reducer* [reductions]
(let [splits (pmap #(split-at 1000 %) reductions) ;; do at least 1000 iterations per cycle
complete (some #(if (empty? (second %)) (last (first %))) splits)]
(or complete (recur (map second splits)))))
(def parallelism-tester (conj (vec (repeat 40000 [computor 0 (range Long/MIN_VALUE 0)]))
[computor 0 (range)]))
(p-or parallelism-tester) ;; terminates even though the first 40K predicates will not
It's extremely hard to define a performant generic version of this. Without knowing the cost per iteration an efficient parallelism strategy is hard to derive - if one iteration take 10s then we'd probably take a single step at a time. If it takes 100ns then we need to take many steps at a time.
Will you consider adopting core.async to handle parallel tasks with async/go or async/thread, and early return with async/alts!?
For example, to turn the core or function from serial into parallel. We can create a macro (I called it por) to wrap the input functions (or predicates) into async/thread and then do a socket select async/alts! on top of them:
(defmacro por [& fns]
`(let [[v# c#] (async/alts!!
[~#(for [f fns]
(list `async/thread f))])]
v#))
(time
(por (do (println "running a") (Thread/sleep 30) :a)
(do (println "running b") (Thread/sleep 20) :b)
(do (println "running c") (Thread/sleep 10) :c)))
;; running a
;; running b
;; running c
;; "Elapsed time: 11.919169 msecs"
;; => :c
In comparison with the original or (which run in serial):
(time
(or (do (println "running a") (Thread/sleep 30) :a)
(do (println "running b") (Thread/sleep 20) :b)
(do (println "running c") (Thread/sleep 10) :c)))
;; running a
;; => :a
;; "Elapsed time: 31.642506 msecs"
I am trying to benchmark an expression using the Criterium library. The expression is
(vec (range 10000000))
To benchmark it i type
(criterium.core/bench (vec (range 10000000)))
and after a while i get
OutOfMemoryError GC overhead limit exceeded java.lang.Long.valueOf (Long.java:840)
As i have seen here this means that the maximum size of the heap (1 GB) is not enough for the data to fit and the garbage collector tries to free space but is unable to do so. However, microbenchmarking the expression like below doesn't produce this error
(dotimes [i 60] (time (vec (range 10000000))))
By the way, i set it to 60 times because i have seen here that the bench macro does 60 executions by default.
The question is why is this happening when using Criterium.
Edit: When starting a fresh repl the code below
{:max (.maxMemory (Runtime/getRuntime)), :total (.totalMemory (Runtime/getRuntime))}
outputs
{:max 922746880, :total 212860928}
After i run (dotimes [i 60] (time (vec (range 10000000)))) or (criterium.core/bench (vec (range 10000000)))
it outputs
{:max 922746880, :total 922746880}
I was able to reproduce the behavior by using this test:
;project.clj
:profiles {:test {:jvm-opts ["-Xms1024m" "-Xmx1024m"]}}
(:require [clojure.test :refer :all]
[criterium.core :as ben])
(deftest ^:focused ben-test
(is (ben/with-progress-reporting
(ben/bench (vec (range 10000000))))))
The stack trace looks like this:
Estimating sampling overhead
Warming up for JIT optimisations 10000000000 ...
compilation occurred before 377618 iterations
...
Estimating execution count ...
Sampling ...
Final GC...
Checking GC...
Finding outliers ...
Bootstrapping ...
Checking outlier significance
Warming up for JIT optimisations 10000000000 ...
compilation occurred before 1 iterations
criterium.core$execute_expr_core_timed_part$fn__40395.invoke (core.clj:370)
criterium.core$execute_expr_core_timed_part.invokeStatic (core.clj:366)
criterium.core$execute_expr_core_timed_part.invoke (core.clj:345)
criterium.core$execute_expr.invokeStatic (core.clj:378)
criterium.core$execute_expr.invoke (core.clj:374)
criterium.core$warmup_for_jit.invokeStatic (core.clj:428)
criterium.core$warmup_for_jit.invoke (core.clj:396)
criterium.core$run_benchmark.invokeStatic (core.clj:479)
criterium.core$run_benchmark.invoke (core.clj:470)
criterium.core$benchmark_STAR_.invokeStatic (core.clj:826)
criterium.core$benchmark_STAR_.invoke (core.clj:812)
We can see here that the error occurs in the JIT-Warning-Up step. The interesting point is the function execute-expr-core-timed-part (core.clj:345). This functions performs the expression (vec (range 10000000)) n times and saves the returned value every time to a so-called mutable place. My hypothesis it that we have the memory leak here.
(time-body
(loop [i (long (dec n))
v (f)]
==> (set-place mutable-place v**)
(if (pos? i)
(recur (unchecked-dec i) (f))
v)))
(require '[clojure.core.reducers :as r])
(def data (into [] (take 10000000 (repeatedly #(rand-int 1000)))))
(defn frequencies [coll]
(reduce (fn [counts x]
(merge-with + counts {x 1}))
{} coll))
(defn pfrequencies [coll]
(r/reduce (fn [counts x]
(merge-with + counts {x 1}))
{} coll))
user=> (time (do (frequencies data) nil))
"Elapsed time: 29697.183 msecs"
user=> (time (do (pfrequencies data) nil))
"Elapsed time: 25273.794 msecs"
user=> (time (do (frequencies data) nil))
"Elapsed time: 25384.086 msecs"
user=> (time (do (pfrequencies data) nil))
"Elapsed time: 25778.502 msecs"
And who can show me an example with significant speedup?
I'm running on Mac OSX 10.7.5 with Java 1.7 on an Intel Core i7 (2 cores, http://ark.intel.com/products/54617).
You called it pfrequencies, which, along with your parallel-processing tag on the question, suggests you think that something is using multiple threads here. That is not the case, and neither is it the "main" goal of the reducers library.
The main thing reducers buy you is that you don't need to allocate many intermediate cons cells for your lazy sequences. Before reducers were introduced, frequencies would allocate 10000000 cons cells to create a sequential view of the vector for reduce to use. Now that reducers exist, vectors know how to reduce themselves without creating such temporary objects. But that feature has been backported into clojure.core/reduce, which behaves exactly like r/reduce (ignoring some minor features that are irrelevant here). So you are just benchmarking your function against an identical clone of itself.
The reducers library also includes the notion of a fold, which can do some work in parallel, and then later merge together the intermediate results. To use this, you need to provide more information than reduce needs: you must define how to start a "chunk" from nothing; your function must be associative; and you must specify how to combine chunks. A. Webb's answer demonstrates how to use fold correctly, to get work done on multiple threads.
However, you're unlikely to get any benefit from folding: in addition to the reason he notes (you give up transients, compared to clojure.core/frequencies), building a map is not easily parallelizable. If the bulk of the work in frequencies were addition (as it would be in something like (frequencies (repeat 1e6 1))), then fold would help; but most of the work is in managing the keys in the hashmap, which really has to be single-threaded eventually. You can build maps in parallel, but then you have to merge them together; since that combination step takes time proportional to the size of the chunk, rather than constant time, you gain little by doing the chunks on a separate thread anyway.
A fold version of your frequencies function would look something like
(defn pfrequencies [coll]
(r/fold
(fn combinef
([] {})
([x y] (merge-with + x y)))
(fn reducef
([counts x] (merge-with + counts {x 1})))
coll))
On 2 cores, it will likely be much slower than clojure.core/frequencies which uses transients. At least on 4 cores, it is faster (2x) than the first implementation, but still slower than clojure.core/frequencies.
You might also experiment with
(defn p2frequencies [coll]
(apply merge-with + (pmap clojure.core/frequencies (partition-all 512 coll))))
Some serious food for thought in the answers here. In this specific case maps should not be needed, since the result domain can be easily predicted and put in a vector where the index can be used. So, a naive implementation of a naive problem would be something like:
(defn freqs
[coll]
(reduce (fn [counts x] (assoc counts x (inc (get counts x))))
(vec (int-array 1000 0))
coll))
(defn rfreqs
[coll]
(r/fold
(fn combinef
([] (vec (int-array 1000 0)))
([& cols] (apply mapv + cols)))
(fn reducef
[counts x] (assoc counts x (inc (get counts x))))
coll))
Here the combinef would be a simple map addition over the 1000 columns of the resulting collections, which should be negligible.
This gives the reducer version a speedup of about 2-3x over the normal version, especially on bigger (10x-100x) datasets. Some twiddling with the partition size of r/fold (optional 'n' parameter) can be done as finetuning. Seemed optimal to use (* 16 1024) with a data size of 1E8 (need 6GB JVM at least).
You could even use transients in both versions, but I didn't notice much improvements.
I know this version isn't appropriate for generic usage, but it might show the speed improvement without the hash management overhead.
I am trying to figure out how to use clojure to efficiently apply a simple operation to a large sequence in parallel. I would like to be able to use the parallel solution to take advantage of the multiple cores on my machine to achieve some speedup.
I am attempting to use pmap in combination with partition-all to reduce the overhead of creating a future for every item in the input seq. Unfortunately, partition-all forces the complete evaluation of each partition seq. This causes an OutOfMemoryError on my machine.
(defn sum [vs]
(reduce + vs))
(def workers
(+ 2 (.. Runtime getRuntime availableProcessors)))
(let
[n 80000000
vs (range n)]
(time (sum vs))
(time (sum (pmap sum (partition-all (long (/ n workers)) vs)))))
How can I apply sum to a large input set, and beat the performance of the serial implementation?
Solution
Thanks to #Arthur Ulfeldt for pointing out the reducers library. Here is the solution using reducers. This code shows the expected performance improvement when running on a multi-core machine. (NOTE: I have changed vs to be a function to make the timing be more accurate)
(require '[clojure.core.reducers :as r])
(let
[n 80000000
vs #(range n)]
(time (reduce + (vs)))
(time (r/fold + (vs)))
When using pmap I have found that fairly large chunks are required to overcome the switching and future overhead try a chunk size of 10,000 for an opperation as fast as +. The potential gains are bounded by the overhead of generating the chunks. This results in an optimal value that balances the available cores and the time required to make the chunks. In this case with + as the workload I was unable to make this faster than the single threaded option.
If you're interested in doing this without pmap and potentially using fork/join check out the new(ish) reducers library
The OOM situation comes from the first test realizing the lazy sequence from (range n) which is then retained so it can be passed to the second sequence.
If I make the + function much slower by defining a slow+ function and use that the diference between single thread, pmap over chunks, and reducers w/ forkJoin become visable:
user> *clojure-version*
{:major 1, :minor 5, :incremental 0, :qualifier "RC15"}
(require '[clojure.core.reducers :as r])
(def workers
(+ 2 (.. Runtime getRuntime availableProcessors)))
(defn slow+
([] 0)
([x] x)
([x y] (reduce + (range 100000)) (+ x y)))
(defn run-test []
(let [n 8000]
(time (reduce slow+ (range n)))
(time (reduce slow+ (pmap #(reduce slow+ %) (partition-all (* workers 100) (range n)))))
(time (r/fold slow+ (vec (range n))))))
user> (run-test)
"Elapsed time: 28655.951241 msecs" ; one thread
"Elapsed time: 6975.488591 msecs" ; pmap over chunks
"Elapsed time: 8170.254426 msecs" ; using reducer
I have a Clojure program which is using some large maps (1000 - 2000 items) which are accessed 30 - 40 times a second and using Strings as the keys. I was wondering if there is a big performance difference if I used keywords or symbols as the keys instead?
Clojure map lookups are very fast, and do not particularly depend on the size of the map.
In fact, they are almost as fast as pure Java HashMaps, while enjoying many advantages over traditional HashMaps including being immutable and thread-safe.
If you are only doing 30-40 lookups a second then I guarantee you will never notice the difference regardless of what you use as keys. Worrying about this would count as premature optimisation.
Let's prove it: the following code does a million map lookups using strings as keys:
(def str-keys (map str (range 1000)))
(def m (zipmap str-keys (range 1000)))
(time (dotimes [i 1000] (doseq [k str-keys] (m k))))
=> "Elapsed time: 69.082224 msecs"
The following does a million map lookups using keywords as keys:
(def kw-keys (map #(keyword (str %)) (range 1000)))
(def m (zipmap kw-keys (range 1000)))
(time (dotimes [i 1000] (doseq [k kw-keys] (m k))))
=> "Elapsed time: 59.212864 msecs"
And for symbols:
(def sym-keys (map #(symbol (str %)) (range 1000)))
(def m (zipmap sym-keys (range 1000)))
(time (dotimes [i 1000] (doseq [k sym-keys] (m k))))
=> "Elapsed time: 61.590925 msecs"
In my tests, Symbols and Keywords were slightly faster than Strings, but still the difference could easily be explained by statistical error, and the average execution time per lookup was less than 100 nanoseconds for all cases.
So your 30-40 lookups are probably taking in the order of 0.001% of your CPU time (this even allows for the fact that in a real app, lookups will probably be a few times slower due to caching issues)
The likely reason for Keywords in particular being slightly faster is that they are interned (and can therefore use reference equality to check for equality). But as you can see the difference is sufficiently small that you really don't need to worry about it.