Binary search in clojure (implementation / performance) - clojure

I wrote a binary search function as part of a larger program, but it seems to be slower than it should be and profiling shows a lot of calls to methods in clojure.lang.Numbers.
My understanding is that Clojure can use primitives when it can determine that it can do so. The calls to the methods in clojure.lang.Numbers seems to indicate that it's not using primitives here.
If I coerce the loop variables to ints, it properly complains that the recur arguments are not primitive. If i coerce those too, the code works again but again it's slow. My only guess is that (quot (+ low-idx high-idx) 2) is not producing a primitive but I'm not sure where to go from here.
This is my first program in Clojure so feel free to let me know if there are more cleaner/functional/Clojure ways to do something.
(defn binary-search
[coll coll-size target]
(let [cnt (dec coll-size)]
(loop [low-idx 0 high-idx cnt]
(if (> low-idx high-idx)
nil
(let [mid-idx (quot (+ low-idx high-idx) 2) mid-val (coll mid-idx)]
(cond
(= mid-val target) mid-idx
(< mid-val target) (recur (inc mid-idx) high-idx)
(> mid-val target) (recur low-idx (dec mid-idx))
))))))
(defn binary-search-perf-test
[test-size]
(do
(let [test-set (vec (range 1 (inc test-size))) test-set-size (count test-set)]
(time (count (map #(binary-search2 test-set test-set-size %) test-set)))
)))

First of all, you can use the binary search implementation provided by java.util.Collections:
(java.util.Collections/binarySearch [0 1 2 3] 2 compare)
; => 2
If you skip the compare, the search will be faster still, unless the collection includes bigints, in which case it'll break.
As for your pure Clojure implementation, you can hint coll-size with ^long in the parameter vector -- or maybe just ask for the vector's size at the beginning of the function's body (that's a very fast, constant time operation), replace the (quot ... 2) call with (bit-shift-right ... 1) and use unchecked math for the index calculations. With some additional tweaks a binary search could be written as follows:
(defn binary-search
"Finds earliest occurrence of x in xs (a vector) using binary search."
([xs x]
(loop [l 0 h (unchecked-dec (count xs))]
(if (<= h (inc l))
(cond
(== x (xs l)) l
(== x (xs h)) h
:else nil)
(let [m (unchecked-add l (bit-shift-right (unchecked-subtract h l) 1))]
(if (< (xs m) x)
(recur (unchecked-inc m) h)
(recur l m)))))))
This is still noticeably slower than the Java variant:
(defn java-binsearch [xs x]
(java.util.Collections/binarySearch xs x compare))
binary-search as defined above seems to take about 25% more time than this java-binsearch.

in Clojure 1.2.x you can only coerce local variables and they can't cross functions calls.
starting in Clojure 1.3.0 Clojure can use primative numbers across function calls but not through Higher Order Functions such as map.
if you are using clojure 1.3.0+ then you should be able to accomplish this using type hints
as with any clojure optimization problem the first step is to turn on (set! *warn-on-reflection* true) then add type hints until it no longer complains.
user=> (set! *warn-on-reflection* true)
true
user=> (defn binary-search
[coll coll-size target]
(let [cnt (dec coll-size)]
(loop [low-idx 0 high-idx cnt]
(if (> low-idx high-idx)
nil
(let [mid-idx (quot (+ low-idx high-idx) 2) mid-val (coll mid-idx)]
(cond
(= mid-val target) mid-idx
(< mid-val target) (recur (inc mid-idx) high-idx)
(> mid-val target) (recur low-idx (dec mid-idx))
))))))
NO_SOURCE_FILE:23 recur arg for primitive local: low_idx is not matching primitive,
had: Object, needed: long
Auto-boxing loop arg: low-idx
#'user/binary-search
user=>
to remove this you can type hint the coll-size argument
(defn binary-search
[coll ^long coll-size target]
(let [cnt (dec coll-size)]
(loop [low-idx 0 high-idx cnt]
(if (> low-idx high-idx)
nil
(let [mid-idx (quot (+ low-idx high-idx) 2) mid-val (coll mid-idx)]
(cond
(= mid-val target) mid-idx
(< mid-val target) (recur (inc mid-idx) high-idx)
(> mid-val target) (recur low-idx (dec mid-idx))
))))))
it is understandably difficult to connect the auto-boxing on line 10 to the coll-size parameter because it goes through cnt then high-idx then mid-ixd and so on, so I generally approach these problems by type-hinting everything until I find the one that makes the warnings go away, then remove hints so long as they stay gone

Related

Writing a lazy-as-possible unfoldr-like function to generate arbitrary factorizations

problem formulation
Informally speaking, I want to write a function which, taking as input a function that generates binary factorizations and an element (usually neutral), creates an arbitrary length factorization generator. To be more specific, let us first define the function nfoldr in Clojure.
(defn nfoldr [f e]
(fn rec [n]
(fn [s]
(if (zero? n)
(if (empty? s) e)
(if (seq s)
(if-some [x ((rec (dec n)) (rest s))]
(f (list (first s) x))))))))
Here nil is used with the meaning "undefined output, input not in function's domain". Additionally, let us view the inverse relation of a function f as a set-valued function defining inv(f)(y) = {x | f(x) = y}.
I want to define a function nunfoldr such that inv(nfoldr(f , e)(n)) = nunfoldr(inv(f) , e)(n) when for every element y inv(f)(y) is finite, for each binary function f, element e and natural number n.
Moreover, I want the factorizations to be generated as lazily as possible, in a 2-dimensional sense of laziness. My goal is that, when getting some part of a factorization for the first time, there does not happen (much) computation needed for next parts or next factorizations. Similarly, when getting one factorization for the first time, there does not happen computation needed for next ones, whereas all the previous ones get in effect fully realized.
In an alternative formulation we can use the following longer version of nfoldr, which is equivalent to the shorter one when e is a neutral element.
(defn nfoldr [f e]
(fn [n]
(fn [s]
(if (zero? n)
(if (empty? s) e)
((fn rec [n]
(fn [s]
(if (= 1 n)
(if (and (seq s) (empty? (rest s))) (first s))
(if (seq s)
(if-some [x ((rec (dec n)) (rest s))]
(f (list (first s) x)))))))
n)))))
a special case
This problem is a generalization of the problem of generating partitions described in that question. Let us see how the old problem can be reduced to the current one. We have for every natural number n:
npt(n) = inv(nconcat(n)) = inv(nfoldr(concat2 , ())(n)) = nunfoldr(inv(concat2) , ())(n) = nunfoldr(pt2 , ())(n)
where:
npt(n) generates n-ary partitions
nconcat(n) computes n-ary concatenation
concat2 computes binary concatenation
pt2 generates binary partitions
So the following definitions give a solution to that problem.
(defn generate [step start]
(fn [x] (take-while some? (iterate step (start x)))))
(defn pt2-step [[x y]]
(if (seq y) (list (concat x (list (first y))) (rest y))))
(def pt2-start (partial list ()))
(def pt2 (generate pt2-step pt2-start))
(def npt (nunfoldr pt2 ()))
I will summarize my story of solving this problem, using the old one to create example runs, and conclude with some observations and proposals for extension.
solution 0
At first, I refined/generalized the approach I took for solving the old problem. Here I write my own versions of concat and map mainly for a better presentation and, in the case of concat, for some added laziness. Of course we can use Clojure's versions or mapcat instead.
(defn fproduct [f]
(fn [s]
(lazy-seq
(if (and (seq f) (seq s))
(cons
((first f) (first s))
((fproduct (rest f)) (rest s)))))))
(defn concat' [s]
(lazy-seq
(if (seq s)
(if-let [x (seq (first s))]
(cons (first x) (concat' (cons (rest x) (rest s))))
(concat' (rest s))))))
(defn map' [f]
(fn rec [s]
(lazy-seq
(if (seq s)
(cons (f (first s)) (rec (rest s)))))))
(defn nunfoldr [f e]
(fn rec [n]
(fn [x]
(if (zero? n)
(if (= e x) (list ()) ())
((comp
concat'
(map' (comp
(partial apply map)
(fproduct (list
(partial partial cons)
(rec (dec n))))))
f)
x)))))
In an attempt to get inner laziness we could replace (partial partial cons) with something like (comp (partial partial concat) list). Although this way we get inner LazySeqs, we do not gain any effective laziness because, before consing, most of the computation required for fully realizing the rest part takes place, something that seems unavoidable within this general approach. Based on the longer version of nfoldr, we can also define the following faster version.
(defn nunfoldr [f e]
(fn [n]
(fn [x]
(if (zero? n)
(if (= e x) (list ()) ())
(((fn rec [n]
(fn [x] (println \< x \>)
(if (= 1 n)
(list (list x))
((comp
concat'
(map' (comp
(partial apply map)
(fproduct (list
(partial partial cons)
(rec (dec n))))))
f)
x))))
n)
x)))))
Here I added a println call inside the main recursive function to get some visualization of eagerness. So let us demonstrate the outer laziness and inner eagerness.
user=> (first ((npt 5) (range 3)))
< (0 1 2) >
< (0 1 2) >
< (0 1 2) >
< (0 1 2) >
< (0 1 2) >
(() () () () (0 1 2))
user=> (ffirst ((npt 5) (range 3)))
< (0 1 2) >
< (0 1 2) >
< (0 1 2) >
< (0 1 2) >
< (0 1 2) >
()
solution 1
Then I thought of a more promising approach, using the function:
(defn transpose [s]
(lazy-seq
(if (every? seq s)
(cons
(map first s)
(transpose (map rest s))))))
To get the new solution we replace the previous argument in the map' call with:
(comp
(partial map (partial apply cons))
transpose
(fproduct (list
repeat
(rec (dec n)))))
Trying to get inner laziness we could replace (partial apply cons) with #(cons (first %) (lazy-seq (second %))) but this is not enough. The problem lies in the (every? seq s) test inside transpose, where checking a lazy sequence of factorizations for emptiness (as a stopping condition) results in realizing it.
solution 2
A first way to tackle the previous problem that came to my mind was to use some additional knowledge about the number of n-ary factorizations of an element. This way we can repeat a certain number of times and use only this sequence for the stopping condition of transpose. So we will replace the test inside transpose with (seq (first s)), add an input count to nunfoldr and replace the argument in the map' call with:
(comp
(partial map #(cons (first %) (lazy-seq (second %))))
transpose
(fproduct (list
(partial apply repeat)
(rec (dec n))))
(fn [[x y]] (list (list ((count (dec n)) y) x) y)))
Let us turn to the problem of partitions and define:
(defn npt-count [n]
(comp
(partial apply *)
#(map % (range 1 n))
(partial comp inc)
(partial partial /)
count))
(def npt (nunfoldr pt2 () npt-count))
Now we can demonstrate outer and inner laziness.
user=> (first ((npt 5) (range 3)))
< (0 1 2) >
(< (0 1 2) >
() < (0 1 2) >
() < (0 1 2) >
() < (0 1 2) >
() (0 1 2))
user=> (ffirst ((npt 5) (range 3)))
< (0 1 2) >
()
However, the dependence on additional knowledge and the extra computational cost make this solution unacceptable.
solution 3
Finally, I thought that in some crucial places I should use a kind of lazy sequences "with a non-lazy end", in order to be able to check for emptiness without realizing. An empty such sequence is just a non-lazy empty list and overall they behave somewhat like the lazy-conss of the early days of Clojure. Using the definitions given below we can reach an acceptable solution, which works under the assumption that always at least one of the concat'ed sequences (when there is one) is non-empty, something that holds in particular when every element has at least one binary factorization and we are using the longer version of nunfoldr.
(def lazy? (partial instance? clojure.lang.IPending))
(defn empty-eager? [x] (and (not (lazy? x)) (empty? x)))
(defn transpose [s]
(lazy-seq
(if-not (some empty-eager? s)
(cons
(map first s)
(transpose (map rest s))))))
(defn concat' [s]
(if-not (empty-eager? s)
(lazy-seq
(if-let [x (seq (first s))]
(cons (first x) (concat' (cons (rest x) (rest s))))
(concat' (rest s))))
()))
(defn map' [f]
(fn rec [s]
(if-not (empty-eager? s)
(lazy-seq (cons (f (first s)) (rec (rest s))))
())))
Note that in this approach the input function f should produce lazy sequences of the new kind and the resulting n-ary factorizer will also produce such sequences. To take care of the new input requirement, for the problem of partitions we define:
(defn pt2 [s]
(lazy-seq
(let [start (list () s)]
(cons
start
((fn rec [[x y]]
(if (seq y)
(lazy-seq
(let [step (list (concat x (list (first y))) (rest y))]
(cons step (rec step))))
()))
start)))))
Once again, let us demonstrate outer and inner laziness.
user=> (first ((npt 5) (range 3)))
< (0 1 2) >
< (0 1 2) >
(< (0 1 2) >
() < (0 1 2) >
() < (0 1 2) >
() () (0 1 2))
user=> (ffirst ((npt 5) (range 3)))
< (0 1 2) >
< (0 1 2) >
()
To make the input and output use standard lazy sequences (sacrificing a bit of laziness), we can add:
(defn lazy-end->eager-end [s]
(if (seq s)
(lazy-seq (cons (first s) (lazy-end->eager-end (rest s))))
()))
(defn eager-end->lazy-end [s]
(lazy-seq
(if-not (empty-eager? s)
(cons (first s) (eager-end->lazy-end (rest s))))))
(def nunfoldr
(comp
(partial comp (partial comp eager-end->lazy-end))
(partial apply nunfoldr)
(fproduct (list
(partial comp lazy-end->eager-end)
identity))
list))
observations and extensions
While creating solution 3, I observed that the old mechanism for lazy sequences in Clojure might not be necessarily inferior to the current one. With the transition, we gained the ability to create lazy sequences without any substantial computation taking place but lost the ability to check for emptiness without doing the computation needed to get one more element. Because both of these abilities can be important in some cases, it would be nice if a new mechanism was introduced, which would combine the advantages of the previous ones. Such a mechanism could use again an outer LazySeq thunk, which when forced would return an empty list or a Cons or another LazySeq or a new LazyCons thunk. This new thunk when forced would return a Cons or perhaps another LazyCons. Now empty? would force only LazySeq thunks while first and rest would also force LazyCons. In this setting map could look like this:
(defn map [f s]
(lazy-seq
(if (empty? s) ()
(lazy-cons
(cons (f (first s)) (map f (rest s)))))))
I have also noticed that the approach taken from solution 1 onwards lends itself to further generalization. If inside the argument in the map' call in the longer nunfoldr we replace cons with concat, transpose with some implementation of Cartesian product and repeat with another recursive call, we can then create versions that "split at different places". For example, using the following as the argument we can define a nunfoldm function that "splits in the middle" and corresponds to an easy-to-imagine nfoldm. Note that all "splitting strategies" are equivalent when f is associative.
(comp
(partial map (partial apply concat))
cproduct
(fproduct (let [n-half (quot n 2)]
(list (rec n-half) (rec (- n n-half))))))
Another natural modification would allow for infinite factorizations. To achieve this, if f generated infinite factorizations, nunfoldr(f , e)(n) should generate the factorizations in an order of type ω, so that each one of them could be produced in finite time.
Other possible extensions include dropping the n parameter, creating relational folds (in correspondence with the relational unfolds we consider here) and generically handling algebraic data structures other than sequences as input/output. This book, which I have just discovered, seems to contain valuable relevant information, given in a categorical/relational language.
Finally, to be able to do this kind of programming more conveniently, we could transfer it into a point-free, algebraic setting. This would require constructing considerable "extra machinery", in fact almost making a new language. This paper demonstrates such a language.

Clojure - Using recursion to find the number of elements in a list

I have written a function that uses recursion to find the number of elements in a list and it works successfully however, I don't particularly like the way I've written it. Now I've written it one way I can't seem to think of a different way of doing it.
My code is below:
(def length
(fn [n]
(loop [i n total 0]
(cond (= 0 i) total
:t (recur (rest i)(inc total))))))
To me it seems like it is over complicated, can anyone think of another way this can be written for comparison?
Any help greatly appreciated.
Here is a naive recursive version:
(defn my-count [coll]
(if (empty? coll)
0
(inc (my-count (rest coll)))))
Bear in mind there's not going to be any tail call optimization going on here so for long lists the stack will overflow.
Here is a version using reduce:
(defn my-count [coll]
(reduce (fn [acc x] (inc acc)) 0 coll))
Here is code showing some different solutions. Normally, you should use the built-in function count.
(def data [:one :two :three])
(defn count-loop [data]
(loop [cnt 0
remaining data]
(if (empty? remaining)
cnt
(recur (inc cnt) (rest remaining)))))
(defn count-recursive [remaining]
(if (empty? remaining)
0
(inc (count-recursive (rest remaining)))))
(defn count-imperative [data]
(let [cnt (atom 0)]
(doseq [elem data]
(swap! cnt inc))
#cnt))
(deftest t-count
(is (= 3 (count data)))
(is (= 3 (count-loop data)))
(is (= 3 (count-recursive data)))
(is (= 3 (count-imperative data))))
Here's one that is tail-call optimized, and doesn't rely on loop. Basically the same as Alan Thompson's first one, but inner functions are the best things. (And feel more idiomatic to me.) :-)
(defn my-count [sq]
(letfn [(inner-count [c s]
(if (empty? s)
c
(recur (inc c) (rest s))))]
(inner-count 0 sq)))
Just for completeness, here is another twist
(defn my-count
([data]
(my-count data 0))
([data counter]
(if (empty? data)
counter
(recur (rest data) (inc counter)))))

Why is let not a valid recur target?

In clojure, this is valid:
(loop [a 5]
(if (= a 0)
"done"
(recur (dec a))))
However, this is not:
(let [a 5]
(if (= a 0)
"done"
(recur (dec a))))
So I'm wondering: why are loop and let separated, given the fact they both (at least conceptually) introduce lexical bindings? That is, why is loop a recur target while let is not?
EDIT: originally wrote "loop target" which I noticed is incorrect.
Consider the following example:
(defn pascal-step [v n]
(if (pos? n)
(let [l (concat v [0])
r (cons 0 v)]
(recur (map + l r) (dec n)))
v))
This function calculates n+mth line of pascal triangle by given mth line.
Now, imagine, that let is a recur target. In this case I won't be able to recursively call the pascal-step function itself from let binding using recur operator.
Now let's make this example a little bit more complex:
(defn pascal-line [n]
(loop [v [1]
i n]
(if (pos? i)
(let [l (concat v [0])
r (cons 0 v)]
(recur (map + l r) (dec i)))
v)))
Now we're calculating nth line of a pascal triangle. As you can see, I need both loop and let here.
This example is quite simple, so you may suggest removing let binding by using (concat v [0]) and (cons 0 v) directly, but I'm just showing you the concept. There may be a more complex examples where let inside a loop is unavoidable.

Is there an online tool to auto-indent and format Clojure code like there are many for JSON?

There are a lot of tools online that take a JSON text and show you formatted and well indented format of the same.
Some go even further and make a nice tree-like structure: http://jsonviewer.stack.hu/
Do we have something similar for Clojure code ?
Or something that can at least auto-indent it.
If the text that I have is this :
(defn prime? [n known](loop [cnt (dec (count known)) acc []](if (< cnt 0) (not (any? acc))
(recur (dec cnt) (concat acc [(zero? (mod n (nth known cnt)))])))))
It should auto-indent to something like this:
(defn prime? [n known]
(loop [cnt (dec (count known)) acc []]
(if (< cnt 0) (not (any? acc))
(recur (dec cnt) (concat acc [(zero? (mod n (nth known cnt)))])))))
Have a look at https://github.com/xsc/rewrite-clj
It is brand new and does exactly what you are asking for.
EDIT I am still getting upvotes for this. I believe I found a better solution: You can easily do this with clojure.pprint utilizing code-dispatch without using an external library.
(clojure.pprint/write '(defn prime? [n known](loop [cnt (dec (count known)) acc []](if (< cnt 0) (not (any? acc)) (recur (dec cnt) (concat acc [(zero? (mod n (nth known cnt)))])))))
:dispatch clojure.pprint/code-dispatch)
=>
(defn prime? [n known]
(loop [cnt (dec (count known)) acc []]
(if (< cnt 0)
(not (any? acc))
(recur
(dec cnt)
(concat acc [(zero? (mod n (nth known cnt)))])))))
I'm not aware of any online services which do this, but there are Clojure libraries which serve this purpose. clojure.pprint comes with Clojure (the key function is clojure.pprint/pprint); Brandon Bloom's fipp is a significantly faster alternative.
Note that neither of these is particularly likely to format code as a programmer armed with Emacs would; they're close enough to be useful, however, and for literal data (not intended to be interpreted as code) may well match human standards.
Following up on this - there is now http://pretty-print.net which will serve this very purpose for EDN and Clojure Code.
There is now https://github.com/weavejester/cljfmt for that purpose
Instructions
Add it in your Leiningen plugins:
:plugins [[lein-cljfmt "0.6.1"]]
Then, to autoformat all code in your project:
lein cljfmt fix
Sample
Your sample code will become:
(defn prime? [n known] (loop [cnt (dec (count known)) acc []] (if (< cnt 0) (not (any? acc))
(recur (dec cnt) (concat acc [(zero? (mod n (nth known cnt)))])))))
After adding some linebreaks and reformatting again:
(defn prime? [n known]
(loop [cnt (dec (count known)) acc []]
(if (< cnt 0) (not (any? acc))
(recur (dec cnt) (concat acc [(zero? (mod n (nth known cnt)))])))))

Project Euler #14 and memoization in Clojure

As a neophyte clojurian, it was recommended to me that I go through the Project Euler problems as a way to learn the language. Its definitely a great way to improve your skills and gain confidence. I just finished up my answer to problem #14. It works fine, but to get it running efficiently I had to implement some memoization. I couldn't use the prepackaged memoize function because of the way my code was structured, and I think it was a good experience to roll my own anyways. My question is if there is a good way to encapsulate my cache within the function itself, or if I have to define an external cache like I have done. Also, any tips to make my code more idiomatic would be appreciated.
(use 'clojure.test)
(def mem (atom {}))
(with-test
(defn chain-length
([x] (chain-length x x 0))
([start-val x c]
(if-let [e (last(find #mem x))]
(let [ret (+ c e)]
(swap! mem assoc start-val ret)
ret)
(if (<= x 1)
(let [ret (+ c 1)]
(swap! mem assoc start-val ret)
ret)
(if (even? x)
(recur start-val (/ x 2) (+ c 1))
(recur start-val (+ 1 (* x 3)) (+ c 1)))))))
(is (= 10 (chain-length 13))))
(with-test
(defn longest-chain
([] (longest-chain 2 0 0))
([c max start-num]
(if (>= c 1000000)
start-num
(let [l (chain-length c)]
(if (> l max)
(recur (+ 1 c) l c)
(recur (+ 1 c) max start-num))))))
(is (= 837799 (longest-chain))))
Since you want the cache to be shared between all invocations of chain-length, you would write chain-length as (let [mem (atom {})] (defn chain-length ...)) so that it would only be visible to chain-length.
In this case, since the longest chain is sufficiently small, you could define chain-length using the naive recursive method and use Clojure's builtin memoize function on that.
Here's an idiomatic(?) version using plain old memoize.
(def chain-length
(memoize
(fn [n]
(cond
(== n 1) 1
(even? n) (inc (chain-length (/ n 2)))
:else (inc (chain-length (inc (* 3 n))))))))
(defn longest-chain [start end]
(reduce (fn [x y]
(if (> (second x) (second y)) x y))
(for [n (range start (inc end))]
[n (chain-length n)])))
If you have an urge to use recur, consider map or reduce first. They often do what you want, and sometimes do it better/faster, since they take advantage of chunked seqs.
(inc x) is like (+ 1 x), but inc is about twice as fast.
You can capture the surrounding environment in a clojure :
(defn my-memoize [f]
(let [cache (atom {})]
(fn [x]
(let [cy (get #cache x)]
(if (nil? cy)
(let [fx (f x)]
(reset! cache (assoc #cache x fx)) fx) cy)))))
(defn mul2 [x] (do (print "Hello") (* 2 x)))
(def mmul2 (my-memoize mul2))
user=> (mmul2 2)
Hello4
user=> (mmul2 2)
4
You see the mul2 funciton is only called once.
So the 'cache' is captured by the clojure and can be used to store the values.