Question
How fast are small Clojure functions like assoc? I suspect that assoc operates in the 100ns to 3us range, which makes it difficult to time.
Using time
user=> (def d {1 1, 2 2})
#'user/d
user=> (time (assoc d 3 3))
"Elapsed time: 0.04989 msecs"
{1 1, 2 2, 3 3}
There is clearly a lot of overhead there so I don't trust this benchmark. Friends pointed me to Criterium which handles a lot of the pain of benchmarking (multiple evaluations, warming up the JVM, GC see How to benchmark functions in Clojure?).
Using Criterium
Sadly, on such a small benchmark even Criterium seems to fail
user=> (use 'criterium.core)
nil
user=> (def d {1 1 2 2})
#'user/d
user=> (bench (assoc d 3 3))
WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies.
WARNING: Final GC required 1.694448681330372 % of runtime
Evaluation count : 218293620 in 60 samples of 3638227 calls.
Execution time mean : -15.677491 ns
Execution time std-deviation : 6.093770 ns
Execution time lower quantile : -20.504699 ns ( 2.5%)
Execution time upper quantile : 1.430632 ns (97.5%)
Overhead used : 123.496848 ns
Just in case you missed it, this operation takes -15ns on average. I know that Clojure is pretty magical, but negative runtimes seem a bit too good to be true.
Repeat the Question
So really, how long does an assoc take? How can I benchmark micro operations in Clojure?
Why not just wrap it in a loop and renormalize?
On my hardware,
(bench (dotimes [_ 1000] (assoc d 3 3)))
yields a mean execution time of roughly 1000x that of
(bench (assoc d 3 3))
namely, about 100 µs in the first case, and 100 ns in the second. If your single-assoc is "in the noise" for Criterium, you could try wrapping it in the same way and get pretty close to the "intrinsic" value. ((dotimes [_ 1000] 1) clocks in at .59 µs, so the extra overhead imposed by the loop itself is comparatively small.)
Criterium tries to net out the overhead due to its own measurement. This can result in negative results for fast functions. See the Measurement Overhead Estimation section of the readme. Your overhead is suspiciously high. You might run (estimatated-overhead!) [sic] a few times to sample for a more accurate overhead figure.
Andy Fingerhut has maintained some of these for a while.
Home - http://jafingerhut.github.io/clojure-benchmarks-results/Clojure-benchmarks.html
Results - http://jafingerhut.github.io/clojure-benchmarks-results/Clojure-expression-benchmark-graphs.html
ClojureScript:
Project - https://github.com/netguy204/cljs-bench
Results - http://www.50ply.com/cljs-bench/
Related
There are three kinds of parallel pipelines in core.async: pipeline, pipeline-blocking, and pipeline-async. The first argument of these functions is n, the ‘parallelism’ of the pipeline.
What is the meaning of this argument?
How does it influence the behaviour of each kind of pipeline? What is a good default value for parallelism? When and why would I increase or decrease that value? What if there are multiple such pipelines in a running program?
The source is informative:
(dotimes [_ n]
(case type
:blocking (thread
(let [job (<!! jobs)]
(when (process job)
(recur))))
:compute (go-loop []
(let [job (<! jobs)]
(when (process job)
(recur))))
:async (go-loop []
(let [job (<! jobs)]
(when (async job)
(recur))))))
n here is the parallelism argument provided; it thus controls the number of threads (in blocking mode) or go-loops (in compute or async mode).
What constitutes a "good default value" depends on your load profile, hardware resources, &c -- if a job is blocking on network access you can potentially have more of them than you have CPUs, whereas if it's blocking on CPU, more threads than cores you expect to be available is a waste; if your bottleneck is local disk I/O, then a number of details well beyond the scope of this question (spinning platters or NVRAM? Are different processes' access needs spread across platters?) become relevant.
In general: Use the same judgment, and experience, and tuning/measurement techniques you would apply to thread-pool sizing generally.
Traditional approach:
(defn make-people [first-names last-names]
(loop [first-names first-names last-names last-names people []]
(if (seq first-names)
(recur
(rest first-names)
(rest last-names)
(conj people {:first (first first-names) :last (first last-names)}))
people)))
Shorter version:
(defn shorter-make-people [first-names last-names]
(for [[first last] (partition 2 (interleave first-names last-names))]
{:first first :last last}))
But I don't have an IDE at hand now to test the performance with large piece of data.
Questions are:
Doesn't 'for' do the same thing as 'loop' and 'recur' in this example ?
Does it apply to more general cases ?
Any performance testing result is recommended.
Reference source code in core.clj: for
loop
for creates a lazy sequence, i. e. it does not eagerly calculate the result as loop does. Instead it calculates the result incrementally and on demand. This adds significant overhead and performs worse than loop (but still in linear time). For that price, lazy sequences offer different benefits, like when processing a lazy sequence you can opt in to not hold the entire sequence in memory at the same time.
http://clojure.org/sequences
Maybe I missed something but why would it? There is no need for recursion whatsoever in case of for comprehension.
Regarding test results:
make-people
(bench (doall (make-people first-names last-names)))
Evaluation count : 1581540 in 60 samples of 26359 calls.
Execution time mean : 40.210018 µs
Execution time std-deviation : 1.838808 µs
Execution time lower quantile : 37.110371 µs ( 2.5%)
Execution time upper quantile : 44.515176 µs (97.5%)
Overhead used : 10.301128 ns
Found 2 outliers in 60 samples (3.3333 %)
low-severe 2 (3.3333 %)
Variance from outliers : 31.9497 % Variance is moderately inflated by outliers
even-shorter-make-people
(bench (doall (shorter-make-people first-names last-names)))
Evaluation count : 306180 in 60 samples of 5103 calls.
Execution time mean : 204.226064 µs
Execution time std-deviation : 5.726497 µs
Execution time lower quantile : 196.693866 µs ( 2.5%)
Execution time upper quantile : 213.226726 µs (97.5%)
Overhead used : 10.301128 ns
even-shorter-make-people
(defn even-shorter-make-people [first-names last-names]
(map #(array-map :first %1 :last %2) first-names last-names))
(bench (doall (even-shorter-make-people first-names last-names)))
Evaluation count : 1049880 in 60 samples of 17498 calls.
Execution time mean : 59.182048 µs
Execution time std-deviation : 2.338641 µs
Execution time lower quantile : 56.361840 µs ( 2.5%)
Execution time upper quantile : 64.056606 µs (97.5%)
Overhead used : 10.301128 ns
In my project I have a lot of coordinates to process, and in 2D situation I found that the construction of (cons x y) is faster than (list x y) and (vector x y).
However, I have no idea how to extend cons to 3D or further because I found no things like cons3. Is there any solution for a fast tuple in common-lisp?
For illustration, I made the following tests:
* (time (loop repeat 10000 do (loop repeat 10000 collect (cons (random 10) (random 10)))))
Evaluation took:
7.729 seconds of real time
7.576000 seconds of total run time (7.564000 user, 0.012000 system)
[ Run times consist of 0.068 seconds GC time, and 7.508 seconds non-GC time. ]
98.02% CPU
22,671,859,477 processor cycles
3,200,156,768 bytes consed
NIL
* (time (loop repeat 10000 do (loop repeat 10000 collect (list (random 10) (random 10)))))
Evaluation took:
8.308 seconds of real time
8.096000 seconds of total run time (8.048000 user, 0.048000 system)
[ Run times consist of 0.212 seconds GC time, and 7.884 seconds non-GC time. ]
97.45% CPU
24,372,206,280 processor cycles
4,800,161,712 bytes consed
NIL
* (time (loop repeat 10000 do (loop repeat 10000 collect (vector (random 10) (random 10)))))
Evaluation took:
8.460 seconds of real time
8.172000 seconds of total run time (8.096000 user, 0.076000 system)
[ Run times consist of 0.260 seconds GC time, and 7.912 seconds non-GC time. ]
96.60% CPU
24,815,721,033 processor cycles
4,800,156,944 bytes consed
NIL
The general way to go about such data structures is to use defstruct. This is how you create data structures in Common Lisp. So, if you wanted to have a point in three-dimensional space, this is more or less what you would do:
(defstruct point-3d x y z)
Why is this better then array:
It names things properly.
It creates a bunch of useful stuff you'd be creating anyway, such as accessors, a function to test for whether some data is of this type, a function to construct objects of this type and some other goodies.
Typing is more elaborate then in arrays: you can specify the type for each slot separately.
Specialized printing function that can print your data nicely.
Why is this better then lists:
You can always ask a struct to behave as a list by doing something like:
(defstruct (point-3d (:type list)) x y z)
All the same stuff as arrays.
Optimization issues:
You should probably try to explore other alternatives. The difference between creating an array or a cons cell of equivalent memory imprint is not worth optimizing it. If you are facing a problem w/r to this particular operation, you should consider the task in general unmanageable. But really I think that techniques like object pooling, memoization and general caching should be tried first.
Another bullet point: you didn't tell the compiler to try to generate efficient code. You can tell the compiler to optimize for size, speed or debugging. You should really measure the performance after you specify what kind of optimization you are trying to pull out.
I've written a quick test to see what's the difference:
(defstruct point-3d
(x 0 :type fixnum)
(y 0 :type fixnum)
(z 0 :type fixnum))
(defun test-struct ()
(declare (optimize speed))
(loop :repeat 1000000 :do
(make-point-3d :x (random 10) :y (random 10) :y (random 10))))
(time (test-struct))
;; Evaluation took:
;; 0.061 seconds of real time
;; 0.060000 seconds of total run time (0.060000 user, 0.000000 system)
;; 98.36% CPU
;; 133,042,429 processor cycles
;; 47,988,448 bytes consed
(defun test-array ()
(declare (optimize speed))
(loop :repeat 1000000
:for point :of-type (simple-array fixnum (3)) :=
(make-array 3 :element-type 'fixnum) :do
(setf (aref point 0) (random 10)
(aref point 1) (random 10)
(aref point 2) (random 10))))
(time (test-array))
;; Evaluation took:
;; 0.048 seconds of real time
;; 0.047000 seconds of total run time (0.046000 user, 0.001000 system)
;; 97.92% CPU
;; 104,386,166 processor cycles
;; 48,018,992 bytes consed
First version of my test came up biased because I forgot to run GC before the first test, so it got disadvantaged by having to reclaim memory left after the previous test. Now the numbers are more precise, and also show that there is practically no difference between using structs and arrays.
So, again, as per my previous suggestion: use object pooling, memoization, whatever other optimization technique you may think of. Optimizing here is a dead end.
Using declarations and inline functions, structs may be made faster than both arrays and lists:
(declaim (optimize (speed 3) (safety 0) (space 3)))
(print "Testing lists");
(terpri)
(time (loop repeat 10000 do
(loop repeat 10000
collect (list (random 1000.0)
(random 1000.0)
(random 1000.0)))))
(print "Testing arrays");
(terpri)
(declaim (inline make-pnt))
(defun make-pnt (&rest coords)
(make-array 3 :element-type 'single-float :initial-contents coords))
(time (loop repeat 10000 do
(loop repeat 10000
collect (make-pnt (random 1000.0)
(random 1000.0)
(random 1000.0)))))
(print "Testing structs")
(terpri)
(declaim (inline new-point))
(defstruct (point
(:type (vector single-float))
(:constructor new-point (x y z)))
(x 0.0 :type single-float)
(y 0.0 :type single-float)
(z 0.0 :type single-float))
(time (loop repeat 10000 do
(loop repeat 10000
collect (new-point (random 1000.0)
(random 1000.0)
(random 1000.0)))))
"Testing lists"
Evaluation took:
8.940 seconds of real time
8.924558 seconds of total run time (8.588537 user, 0.336021 system)
[ Run times consist of 1.109 seconds GC time, and 7.816 seconds non-GC time. ]
99.83% CPU
23,841,394,328 processor cycles
6,400,180,640 bytes consed
"Testing arrays"
Evaluation took:
8.154 seconds of real time
8.140509 seconds of total run time (7.948497 user, 0.192012 system)
[ Run times consist of 0.724 seconds GC time, and 7.417 seconds non-GC time. ]
99.84% CPU
21,743,874,280 processor cycles
4,800,178,240 bytes consed
"Testing structs"
Evaluation took:
7.631 seconds of real time
7.620476 seconds of total run time (7.432464 user, 0.188012 system)
[ Run times consist of 0.820 seconds GC time, and 6.801 seconds non-GC time. ]
99.86% CPU
20,350,103,048 processor cycles
4,800,179,360 bytes consed
I assume you are working with floating-point values, in which case (make-array 3 :element-type 'single-float) may be best. This way, you can expect the floats to be stored unboxed (in most implementations).
Be sure to sprinkle liberally with (declare (type (simple-array single-float (3)))).
I would like to find the quickest way of generating a non-lazy list of numbers in clojure. Currently I am using the following :
(doall (take 5000000 (iterate inc 5000000000)))
to generate a non-lazy list of numbers between 5 billion and 5.005 billion. Are there any quicker ways of doing this? Thanks
(p.s. I am aware that using lists to store sequences of numbers is sub-optimal. However I am using this as a benchmark for the Shen.java compiler)
Actually, doall works great. The only problem with your example is slow iterate function. You should use range instead of it:
(doall (range 5000000000 5005000000))
range is very fast. It's lazy, but it optimized and it generates numbers in chunks.
Here are benchmark results for iterate and run obtained using criterium:
user=> (quick-bench (doall (take 5000 (iterate inc 5000000))))
Evaluation count : 180 in 6 samples of 30 calls.
Execution time mean : 3.175749 ms
Execution time std-deviation : 1.179449 ms
Execution time lower quantile : 2.428681 ms ( 2.5%)
Execution time upper quantile : 4.735748 ms (97.5%)
Overhead used : 14.758153 ns
user=> (quick-bench (doall (range 5000000 5005000)))
Evaluation count : 672 in 6 samples of 112 calls.
Execution time mean : 1.253228 ms
Execution time std-deviation : 350.301594 µs
Execution time lower quantile : 845.026223 µs ( 2.5%)
Execution time upper quantile : 1.582950 ms (97.5%)
Overhead used : 14.758153 ns
As you can see, range is 2.5 times faster than iterate here.
On my PC it takes less then a second to generate all 5000000 numbers, but there are some tricks to make it work even faster.
For example, you may run the generation in the separate thread:
(let [numbers (range 5000000000 5005000000)]
(future (dorun numbers))
...)
It won't make generation faster, but you'll be able to use your sequence immediately, before it'll be fully-realized.
doall is usually what you need. But what do you mean by "quickest"? quickest in the context of performance or in the context of coding?
I have a set of a small number of functions. Two functions perform a mathematical overlay operation (defined on http://docs.gimp.org/en/gimp-concepts-layer-modes.html, but a little down -- just search for "overlay" to find the math) in different ways. Now, this operation is something that Gimp does very quickly, in under a second, but I can't seem to optimize my code to get anything like remotely similar time.
(My application is a GUI application to help me see and compare various overlay combinations of a large number of files. The Gimp layer interface actually makes it rather difficult to just pick two images to overlay, then pick a different two, etc.)
Here is the code:
(set! *warn-on-reflection* true )
(defn to-8-bit [v]
(short (* (/ v 65536) 256)))
(defn overlay-sample [base-p over-p]
(to-8-bit
(* (/ base-p 65536)
(+ base-p
(* (/ (* 2 over-p) 65536)
(- 65536 base-p))))))
(defn overlay-map [^shorts base ^shorts over]
(let [ovl (time (doall (map overlay-sample ^shorts base ^shorts over)))]
(time (into-array Short/TYPE ovl))))
(defn overlay-array [base over]
(let [ovl (time (amap base
i
r
(int (overlay-sample (aget r i)
(aget over i)))))]
ovl))
overlay-map and overlay-array do the same operation in different ways. I've written other versions of this operation, too. However, overlay-map is, by far, the fastest I have.
base and over, in both functions, are 16-bit integer arrays. The actual size of each is 1,276,800 samples (an 800 x 532 image with 3 samples per pixel). The end result should be a single array of the same, but scaled down to 8-bits.
My results from the (time) operation are pretty consistent. overlay-map runs the actual mathematical operation in about 16 or 17 seconds, then spends another 5 seconds copying the resulting sequence back into an integer array.
overlay-array takes about 111 seconds.
I've done a lot of reading about using arrays, type hints, etc, but my Java-Array-Only operation is amazingly slow! amap, aget, etc was all supposed to be fast, but I have read the code and there is nothing that looks like a speed optimization there, and my results are consistent. I've even tried other computers and seen roughly the same difference.
Now, 16-17 seconds is, actually rather painful at this data set, but I've been caching the results so that I can easily switch back and forth. The same operation would take an atrociously long time if I increased the size of the dataset to anything like a full-size image (4770x3177). And, there's other operations I want to be doing, too.
So, any suggestions on how to speed this up? What am I missing here?
UPDATE: I just made the entire project pertaining to this code public, so you can see the current version entire script I am using for speed tests at https://bitbucket.org/savannidgerinel/hdr-darkroom/src/62a42fcf6a4b/scripts/speed_test.clj . Feel free to download it and try it on your own gear, but obviously change the image file paths before running it.
Since your functions are purely mathematical, you might want to check out memoize
(def fast-overlay (memoize overlay-sample))
(time (fast-overlay 1000 2000))
"Elapsed time: 1.279 msecs"
(time (fast-overlay 1000 2000))
"Elapsed time: 0.056 msecs"
What's happening here is the arguments are being cached as the key and the return is the value. Where the value has already been computed, the value is returned rather than the function executed.