Quickest way to generate a non-lazy list of numbers - clojure

I would like to find the quickest way of generating a non-lazy list of numbers in clojure. Currently I am using the following :
(doall (take 5000000 (iterate inc 5000000000)))
to generate a non-lazy list of numbers between 5 billion and 5.005 billion. Are there any quicker ways of doing this? Thanks
(p.s. I am aware that using lists to store sequences of numbers is sub-optimal. However I am using this as a benchmark for the Shen.java compiler)

Actually, doall works great. The only problem with your example is slow iterate function. You should use range instead of it:
(doall (range 5000000000 5005000000))
range is very fast. It's lazy, but it optimized and it generates numbers in chunks.
Here are benchmark results for iterate and run obtained using criterium:
user=> (quick-bench (doall (take 5000 (iterate inc 5000000))))
Evaluation count : 180 in 6 samples of 30 calls.
Execution time mean : 3.175749 ms
Execution time std-deviation : 1.179449 ms
Execution time lower quantile : 2.428681 ms ( 2.5%)
Execution time upper quantile : 4.735748 ms (97.5%)
Overhead used : 14.758153 ns
user=> (quick-bench (doall (range 5000000 5005000)))
Evaluation count : 672 in 6 samples of 112 calls.
Execution time mean : 1.253228 ms
Execution time std-deviation : 350.301594 µs
Execution time lower quantile : 845.026223 µs ( 2.5%)
Execution time upper quantile : 1.582950 ms (97.5%)
Overhead used : 14.758153 ns
As you can see, range is 2.5 times faster than iterate here.
On my PC it takes less then a second to generate all 5000000 numbers, but there are some tricks to make it work even faster.
For example, you may run the generation in the separate thread:
(let [numbers (range 5000000000 5005000000)]
(future (dorun numbers))
...)
It won't make generation faster, but you'll be able to use your sequence immediately, before it'll be fully-realized.

doall is usually what you need. But what do you mean by "quickest"? quickest in the context of performance or in the context of coding?

Related

Passing pointers of variables in Julia

Is it possible to use in Julia something like pointers or references as in C/C++ or C#? I'm wondering because it will be helpful to pass weighty objects as a pointer/reference but not as a value. For example, using pointers memory to store an object can be allocated once for the whole program and then the pointer can be passed through the program. As I can imagine, it will boost performance in memory and computing power usage.
Simple C++ code showing what I'm trying to execute in Julia:
#include <iostream>
void some_function(int* variable){ // declare function
*variable += 1; // add a value to the variable
}
int main(){
int very_big_object = 1; // define variable
some_function( &very_big_object ); // pass pointer of very_big_object to some_function
std::cout << very_big_object; // write value of very_big_object to stdout
return 0; // end of the program
}
Output:
2
New object is created, its pointer is then passed to some_funciton that modifies this object using passed pointer. Returning of the new value is not necessary, because program edited original object, not copy. After executing some_function value of the variable is print to see how it has changed.
While you can manually obtain pointers for Julia objects, this is actually not necessary to obtain the performance and results you want. As the manual notes:
Julia function arguments follow a convention sometimes called "pass-by-sharing", which means that values are not copied when they are passed to functions. Function arguments themselves act as new variable bindings (new locations that can refer to values), but the values they refer to are identical to the passed values. Modifications to mutable values (such as Arrays) made within a function will be visible to the caller. This is the same behavior found in Scheme, most Lisps, Python, Ruby and Perl, among other dynamic languages.
consequently, you can just pass the object to the function normally, and then operate on it in-place. Functions that mutate their input arguments (thus including any in-place functions) by convention in Julia have names that end in !. The only catch is that you have to be sure not to do anything within the body of the function that would trigger the object to be copied.
So, for example
function foo(A)
B = A .+ 1 # This makes a copy, uh oh
return B
end
function foo!(A)
A .= A .+ 1 # Mutate A in-place
return A # Technically don't have to return anything at all, but there is also no performance cost to returning the mutated A
end
(note in particular the . in front of .= in the second version; this is critical. If we had left that off, we actually would not have mutated A, but just reassigned the name A (within the scope of the function) to refer to the result of the RHS, so this actually would be entirely equivalent to the first version if it weren't for that .)
If we then benchmark these, you can see the difference clearly:
julia> large_array = rand(10000,1000);
julia> using BenchmarkTools
julia> #benchmark foo(large_array)
BechmarkTools.Trial: 144 samples with 1 evaluations.
Range (min … max): 23.315 ms … 78.293 ms ┊ GC (min … max): 0.00% … 43.25%
Time (median): 27.278 ms ┊ GC (median): 0.00%
Time (mean ± σ): 34.705 ms ± 12.316 ms ┊ GC (mean ± σ): 23.72% ± 23.04%
▁▄▃▄█ ▁
▇█████▇█▇▆▅▃▁▃▁▃▁▁▁▄▁▁▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▅▃▄▁▄▄▃▄▆▅▆▃▅▆▆▆▅▃ ▃
23.3 ms Histogram: frequency by time 55.3 ms <
Memory estimate: 76.29 MiB, allocs estimate: 2.
julia> #benchmark foo!(large_array)
BechmarkTools.Trial: 729 samples with 1 evaluations.
Range (min … max): 5.209 ms … 9.655 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 6.529 ms ┊ GC (median): 0.00%
Time (mean ± σ): 6.845 ms ± 955.282 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁ ▄▇██▇▆ ▁ ▁
▂▂▃▂▄▃▇▄▅▅████████▇█▆██▄▅▅▆▆▄▅▆▇▅▆▄▃▃▄▃▄▃▃▃▂▂▃▃▃▃▃▃▂▃▃▃▂▂▂▃ ▃
5.21 ms Histogram: frequency by time 9.33 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
Note especially the difference in memory usage and allocations in addition to the ~4x time difference. At this point, pretty much all the time is being spent on the actual addition, so the only thing left to optimize if this were performance-critical code would be to make sure that the code is efficiently using all of your CPU's SIMD vector registers and instructions, which you can do with the LoopVectorization.jl package:
using LoopVectorization
function foo_simd!(A)
#turbo A .= A .+ 1 # Mutate A in-place. Equivalentely #turbo #. A = A + 1
return A
end
julia> #benchmark foo_simd!(large_array)
BechmarkTools.Trial: 986 samples with 1 evaluations.
Range (min … max): 4.873 ms … 7.387 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 4.922 ms ┊ GC (median): 0.00%
Time (mean ± σ): 5.061 ms ± 330.307 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█▇▄▁▁ ▂▂▅▁
█████████████▆▆▇▆▅▅▄▅▅▅▆▅▅▅▅▅▅▆▅▅▄▁▄▅▄▅▄▁▄▁▄▁▅▄▅▁▁▄▅▁▄▁▁▄▁▄ █
4.87 ms Histogram: log(frequency) by time 6.7 ms <
Memory estimate: 0 bytes, allocs estimate: 0.
This buys us a bit more performance, but it looks like for this particular case Julia's normal compiler probably already found some of these SIMD optimizations.
Now, if for any reason you still want a literal pointer, you can always get this with Base.pointer, though note that this comes with some significant caveats and is generally not what you want.
help?> Base.pointer
pointer(array [, index])
Get the native address of an array or string, optionally at a given location index.
This function is "unsafe". Be careful to ensure that a Julia reference to array exists
as long as this pointer will be used. The GC.#preserve macro should be used to protect
the array argument from garbage collection within a given block of code.
Calling Ref(array[, index]) is generally preferable to this function as it guarantees
validity.
While Julia uses "pass-by-sharing" and normally you do not have to/do not want to use pointers in some cases you actually want do it and you can!
You construct pointers by Ref{Type} and then deference them using [].
Consider the following function mutating its argument.
function mutate(v::Ref{Int})
v[] = 999
end
This can be used in the following way:
julia> vv = Ref(33)
Base.RefValue{Int64}(33)
julia> mutate(vv);
julia> vv
Base.RefValue{Int64}(999)
julia> vv[]
999
For a longer discussion on passing reference in Julia please have a look at this post How to pass an object by reference and value in Julia?

Is 'for' using tail recursion like loop-recur pattern?

Traditional approach:
(defn make-people [first-names last-names]
(loop [first-names first-names last-names last-names people []]
(if (seq first-names)
(recur
(rest first-names)
(rest last-names)
(conj people {:first (first first-names) :last (first last-names)}))
people)))
Shorter version:
(defn shorter-make-people [first-names last-names]
(for [[first last] (partition 2 (interleave first-names last-names))]
{:first first :last last}))
But I don't have an IDE at hand now to test the performance with large piece of data.
Questions are:
Doesn't 'for' do the same thing as 'loop' and 'recur' in this example ?
Does it apply to more general cases ?
Any performance testing result is recommended.
Reference source code in core.clj: for
loop
for creates a lazy sequence, i. e. it does not eagerly calculate the result as loop does. Instead it calculates the result incrementally and on demand. This adds significant overhead and performs worse than loop (but still in linear time). For that price, lazy sequences offer different benefits, like when processing a lazy sequence you can opt in to not hold the entire sequence in memory at the same time.
http://clojure.org/sequences
Maybe I missed something but why would it? There is no need for recursion whatsoever in case of for comprehension.
Regarding test results:
make-people
(bench (doall (make-people first-names last-names)))
Evaluation count : 1581540 in 60 samples of 26359 calls.
Execution time mean : 40.210018 µs
Execution time std-deviation : 1.838808 µs
Execution time lower quantile : 37.110371 µs ( 2.5%)
Execution time upper quantile : 44.515176 µs (97.5%)
Overhead used : 10.301128 ns
Found 2 outliers in 60 samples (3.3333 %)
low-severe 2 (3.3333 %)
Variance from outliers : 31.9497 % Variance is moderately inflated by outliers
even-shorter-make-people
(bench (doall (shorter-make-people first-names last-names)))
Evaluation count : 306180 in 60 samples of 5103 calls.
Execution time mean : 204.226064 µs
Execution time std-deviation : 5.726497 µs
Execution time lower quantile : 196.693866 µs ( 2.5%)
Execution time upper quantile : 213.226726 µs (97.5%)
Overhead used : 10.301128 ns
even-shorter-make-people
(defn even-shorter-make-people [first-names last-names]
(map #(array-map :first %1 :last %2) first-names last-names))
(bench (doall (even-shorter-make-people first-names last-names)))
Evaluation count : 1049880 in 60 samples of 17498 calls.
Execution time mean : 59.182048 µs
Execution time std-deviation : 2.338641 µs
Execution time lower quantile : 56.361840 µs ( 2.5%)
Execution time upper quantile : 64.056606 µs (97.5%)
Overhead used : 10.301128 ns

Microbenchmark Clojure functions

Question
How fast are small Clojure functions like assoc? I suspect that assoc operates in the 100ns to 3us range, which makes it difficult to time.
Using time
user=> (def d {1 1, 2 2})
#'user/d
user=> (time (assoc d 3 3))
"Elapsed time: 0.04989 msecs"
{1 1, 2 2, 3 3}
There is clearly a lot of overhead there so I don't trust this benchmark. Friends pointed me to Criterium which handles a lot of the pain of benchmarking (multiple evaluations, warming up the JVM, GC see How to benchmark functions in Clojure?).
Using Criterium
Sadly, on such a small benchmark even Criterium seems to fail
user=> (use 'criterium.core)
nil
user=> (def d {1 1 2 2})
#'user/d
user=> (bench (assoc d 3 3))
WARNING: JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active. See http://www.slideshare.net/CharlesNutter/javaone-2012-jvm-jit-for-dummies.
WARNING: Final GC required 1.694448681330372 % of runtime
Evaluation count : 218293620 in 60 samples of 3638227 calls.
Execution time mean : -15.677491 ns
Execution time std-deviation : 6.093770 ns
Execution time lower quantile : -20.504699 ns ( 2.5%)
Execution time upper quantile : 1.430632 ns (97.5%)
Overhead used : 123.496848 ns
Just in case you missed it, this operation takes -15ns on average. I know that Clojure is pretty magical, but negative runtimes seem a bit too good to be true.
Repeat the Question
So really, how long does an assoc take? How can I benchmark micro operations in Clojure?
Why not just wrap it in a loop and renormalize?
On my hardware,
(bench (dotimes [_ 1000] (assoc d 3 3)))
yields a mean execution time of roughly 1000x that of
(bench (assoc d 3 3))
namely, about 100 µs in the first case, and 100 ns in the second. If your single-assoc is "in the noise" for Criterium, you could try wrapping it in the same way and get pretty close to the "intrinsic" value. ((dotimes [_ 1000] 1) clocks in at .59 µs, so the extra overhead imposed by the loop itself is comparatively small.)
Criterium tries to net out the overhead due to its own measurement. This can result in negative results for fast functions. See the Measurement Overhead Estimation section of the readme. Your overhead is suspiciously high. You might run (estimatated-overhead!) [sic] a few times to sample for a more accurate overhead figure.
Andy Fingerhut has maintained some of these for a while.
Home - http://jafingerhut.github.io/clojure-benchmarks-results/Clojure-benchmarks.html
Results - http://jafingerhut.github.io/clojure-benchmarks-results/Clojure-expression-benchmark-graphs.html
ClojureScript:
Project - https://github.com/netguy204/cljs-bench
Results - http://www.50ply.com/cljs-bench/

Is there a 'tuple' equivalent thing in common-lisp?

In my project I have a lot of coordinates to process, and in 2D situation I found that the construction of (cons x y) is faster than (list x y) and (vector x y).
However, I have no idea how to extend cons to 3D or further because I found no things like cons3. Is there any solution for a fast tuple in common-lisp?
For illustration, I made the following tests:
* (time (loop repeat 10000 do (loop repeat 10000 collect (cons (random 10) (random 10)))))
Evaluation took:
7.729 seconds of real time
7.576000 seconds of total run time (7.564000 user, 0.012000 system)
[ Run times consist of 0.068 seconds GC time, and 7.508 seconds non-GC time. ]
98.02% CPU
22,671,859,477 processor cycles
3,200,156,768 bytes consed
NIL
* (time (loop repeat 10000 do (loop repeat 10000 collect (list (random 10) (random 10)))))
Evaluation took:
8.308 seconds of real time
8.096000 seconds of total run time (8.048000 user, 0.048000 system)
[ Run times consist of 0.212 seconds GC time, and 7.884 seconds non-GC time. ]
97.45% CPU
24,372,206,280 processor cycles
4,800,161,712 bytes consed
NIL
* (time (loop repeat 10000 do (loop repeat 10000 collect (vector (random 10) (random 10)))))
Evaluation took:
8.460 seconds of real time
8.172000 seconds of total run time (8.096000 user, 0.076000 system)
[ Run times consist of 0.260 seconds GC time, and 7.912 seconds non-GC time. ]
96.60% CPU
24,815,721,033 processor cycles
4,800,156,944 bytes consed
NIL
The general way to go about such data structures is to use defstruct. This is how you create data structures in Common Lisp. So, if you wanted to have a point in three-dimensional space, this is more or less what you would do:
(defstruct point-3d x y z)
Why is this better then array:
It names things properly.
It creates a bunch of useful stuff you'd be creating anyway, such as accessors, a function to test for whether some data is of this type, a function to construct objects of this type and some other goodies.
Typing is more elaborate then in arrays: you can specify the type for each slot separately.
Specialized printing function that can print your data nicely.
Why is this better then lists:
You can always ask a struct to behave as a list by doing something like:
(defstruct (point-3d (:type list)) x y z)
All the same stuff as arrays.
Optimization issues:
You should probably try to explore other alternatives. The difference between creating an array or a cons cell of equivalent memory imprint is not worth optimizing it. If you are facing a problem w/r to this particular operation, you should consider the task in general unmanageable. But really I think that techniques like object pooling, memoization and general caching should be tried first.
Another bullet point: you didn't tell the compiler to try to generate efficient code. You can tell the compiler to optimize for size, speed or debugging. You should really measure the performance after you specify what kind of optimization you are trying to pull out.
I've written a quick test to see what's the difference:
(defstruct point-3d
(x 0 :type fixnum)
(y 0 :type fixnum)
(z 0 :type fixnum))
(defun test-struct ()
(declare (optimize speed))
(loop :repeat 1000000 :do
(make-point-3d :x (random 10) :y (random 10) :y (random 10))))
(time (test-struct))
;; Evaluation took:
;; 0.061 seconds of real time
;; 0.060000 seconds of total run time (0.060000 user, 0.000000 system)
;; 98.36% CPU
;; 133,042,429 processor cycles
;; 47,988,448 bytes consed
(defun test-array ()
(declare (optimize speed))
(loop :repeat 1000000
:for point :of-type (simple-array fixnum (3)) :=
(make-array 3 :element-type 'fixnum) :do
(setf (aref point 0) (random 10)
(aref point 1) (random 10)
(aref point 2) (random 10))))
(time (test-array))
;; Evaluation took:
;; 0.048 seconds of real time
;; 0.047000 seconds of total run time (0.046000 user, 0.001000 system)
;; 97.92% CPU
;; 104,386,166 processor cycles
;; 48,018,992 bytes consed
First version of my test came up biased because I forgot to run GC before the first test, so it got disadvantaged by having to reclaim memory left after the previous test. Now the numbers are more precise, and also show that there is practically no difference between using structs and arrays.
So, again, as per my previous suggestion: use object pooling, memoization, whatever other optimization technique you may think of. Optimizing here is a dead end.
Using declarations and inline functions, structs may be made faster than both arrays and lists:
(declaim (optimize (speed 3) (safety 0) (space 3)))
(print "Testing lists");
(terpri)
(time (loop repeat 10000 do
(loop repeat 10000
collect (list (random 1000.0)
(random 1000.0)
(random 1000.0)))))
(print "Testing arrays");
(terpri)
(declaim (inline make-pnt))
(defun make-pnt (&rest coords)
(make-array 3 :element-type 'single-float :initial-contents coords))
(time (loop repeat 10000 do
(loop repeat 10000
collect (make-pnt (random 1000.0)
(random 1000.0)
(random 1000.0)))))
(print "Testing structs")
(terpri)
(declaim (inline new-point))
(defstruct (point
(:type (vector single-float))
(:constructor new-point (x y z)))
(x 0.0 :type single-float)
(y 0.0 :type single-float)
(z 0.0 :type single-float))
(time (loop repeat 10000 do
(loop repeat 10000
collect (new-point (random 1000.0)
(random 1000.0)
(random 1000.0)))))
"Testing lists"
Evaluation took:
8.940 seconds of real time
8.924558 seconds of total run time (8.588537 user, 0.336021 system)
[ Run times consist of 1.109 seconds GC time, and 7.816 seconds non-GC time. ]
99.83% CPU
23,841,394,328 processor cycles
6,400,180,640 bytes consed
"Testing arrays"
Evaluation took:
8.154 seconds of real time
8.140509 seconds of total run time (7.948497 user, 0.192012 system)
[ Run times consist of 0.724 seconds GC time, and 7.417 seconds non-GC time. ]
99.84% CPU
21,743,874,280 processor cycles
4,800,178,240 bytes consed
"Testing structs"
Evaluation took:
7.631 seconds of real time
7.620476 seconds of total run time (7.432464 user, 0.188012 system)
[ Run times consist of 0.820 seconds GC time, and 6.801 seconds non-GC time. ]
99.86% CPU
20,350,103,048 processor cycles
4,800,179,360 bytes consed
I assume you are working with floating-point values, in which case (make-array 3 :element-type 'single-float) may be best. This way, you can expect the floats to be stored unboxed (in most implementations).
Be sure to sprinkle liberally with (declare (type (simple-array single-float (3)))).

How to improve performance on a function that operates on two arrays in clojure

I have a set of a small number of functions. Two functions perform a mathematical overlay operation (defined on http://docs.gimp.org/en/gimp-concepts-layer-modes.html, but a little down -- just search for "overlay" to find the math) in different ways. Now, this operation is something that Gimp does very quickly, in under a second, but I can't seem to optimize my code to get anything like remotely similar time.
(My application is a GUI application to help me see and compare various overlay combinations of a large number of files. The Gimp layer interface actually makes it rather difficult to just pick two images to overlay, then pick a different two, etc.)
Here is the code:
(set! *warn-on-reflection* true )
(defn to-8-bit [v]
(short (* (/ v 65536) 256)))
(defn overlay-sample [base-p over-p]
(to-8-bit
(* (/ base-p 65536)
(+ base-p
(* (/ (* 2 over-p) 65536)
(- 65536 base-p))))))
(defn overlay-map [^shorts base ^shorts over]
(let [ovl (time (doall (map overlay-sample ^shorts base ^shorts over)))]
(time (into-array Short/TYPE ovl))))
(defn overlay-array [base over]
(let [ovl (time (amap base
i
r
(int (overlay-sample (aget r i)
(aget over i)))))]
ovl))
overlay-map and overlay-array do the same operation in different ways. I've written other versions of this operation, too. However, overlay-map is, by far, the fastest I have.
base and over, in both functions, are 16-bit integer arrays. The actual size of each is 1,276,800 samples (an 800 x 532 image with 3 samples per pixel). The end result should be a single array of the same, but scaled down to 8-bits.
My results from the (time) operation are pretty consistent. overlay-map runs the actual mathematical operation in about 16 or 17 seconds, then spends another 5 seconds copying the resulting sequence back into an integer array.
overlay-array takes about 111 seconds.
I've done a lot of reading about using arrays, type hints, etc, but my Java-Array-Only operation is amazingly slow! amap, aget, etc was all supposed to be fast, but I have read the code and there is nothing that looks like a speed optimization there, and my results are consistent. I've even tried other computers and seen roughly the same difference.
Now, 16-17 seconds is, actually rather painful at this data set, but I've been caching the results so that I can easily switch back and forth. The same operation would take an atrociously long time if I increased the size of the dataset to anything like a full-size image (4770x3177). And, there's other operations I want to be doing, too.
So, any suggestions on how to speed this up? What am I missing here?
UPDATE: I just made the entire project pertaining to this code public, so you can see the current version entire script I am using for speed tests at https://bitbucket.org/savannidgerinel/hdr-darkroom/src/62a42fcf6a4b/scripts/speed_test.clj . Feel free to download it and try it on your own gear, but obviously change the image file paths before running it.
Since your functions are purely mathematical, you might want to check out memoize
(def fast-overlay (memoize overlay-sample))
(time (fast-overlay 1000 2000))
"Elapsed time: 1.279 msecs"
(time (fast-overlay 1000 2000))
"Elapsed time: 0.056 msecs"
What's happening here is the arguments are being cached as the key and the return is the value. Where the value has already been computed, the value is returned rather than the function executed.