Select elements from a nested structure that match a condition in Clojure - clojure

I recently discovered the Specter library that provides data-structure navigation and transformation functions and is written in Clojure.
Implementing some of its API as a learning exercise seemed like a good idea. Specter implements an API taking a function and a nested structure as arguments and returns a vector of elements from the nested structure that satisfies the function like below:
(select (walker number?) [1 :a {:b 2}]) => [1 2]
Below is my attempt at implementing a function with similar API:
(defn select-walker [afn ds]
(vec (if (and (coll? ds) (not-empty ds))
(concat (select-walker afn (first ds))
(select-walker afn (rest ds)))
(if (afn ds) [ds]))))
(select-walker number? [1 :a {:b 2}]) => [1 2]
I have tried implementing select-walker by using list comprehension, looping, and using cons and conj. In all these cases
the return value was a nested list instead of a flat vector of elements.
Yet my implementation does not seem like idiomatic Clojure and has poor time and space complexity.
(time (dotimes [_ 1000] (select (walker number?) (range 100))))
"Elapsed time: 19.445396 msecs"
(time (dotimes [_ 1000] (select-walker number? (range 100))))
"Elapsed time: 237.000334 msecs"
Notice that my implementation is about 12 times slower than Specter's implementation.
I have three questions on the implemention of select-walker.
Is a tail-recursive implementaion of select-walker possible?
Can select-walker be written in more idiomatic Clojure?
Any hints to make select-walker execute faster?

there are at least two possibilities to make it tail recursive. First one is to process data in loop like this:
(defn select-walker-rec [afn ds]
(loop [res [] ds ds]
(cond (empty? ds) res
(coll? (first ds)) (recur res
(doall (concat (first ds)
(rest ds))))
(afn (first ds)) (recur (conj res (first ds)) (rest ds))
:else (recur res (rest ds)))))
in repl:
user> (select-walker-rec number? [1 :a {:b 2}])
[1 2]
user> user> (time (dotimes [_ 1000] (select-walker-rec number? (range 100))))
"Elapsed time: 19.428887 msecs"
(simple select-walker works about 200ms for me)
the second one (slower though, and more suitable for more difficult tasks) is to use zippers:
(require '[clojure.zip :as z])
(defn select-walker-z [afn ds]
(loop [res [] curr (z/zipper coll? seq nil ds)]
(cond (z/end? curr) res
(z/branch? curr) (recur res (z/next curr))
(afn (z/node curr)) (recur (conj res (z/node curr))
(z/next curr))
:else (recur res (z/next curr)))))
user> (time (dotimes [_ 1000] (select-walker-z number? (range 100))))
"Elapsed time: 219.015153 msecs"
this one is really slow, since zipper operates on more complex structures. It's great power brings unneeded overhead to this simple task.
the most idiomatic approach i guess, is to use tree-seq:
(defn select-walker-t [afn ds]
(filter #(and (not (coll? %)) (afn %))
(tree-seq coll? seq ds)))
user> (time (dotimes [_ 1000] (select-walker-t number? (range 100))))
"Elapsed time: 1.320209 msecs"
it is incredibly fast, as it produces a lazy sequence of results. In fact you should realize its data for the fair test:
user> (time (dotimes [_ 1000] (doall (select-walker-t number? (range 100)))))
"Elapsed time: 53.641014 msecs"
one more thing to notice about this variant, is that it's not tail recursive, so it would fail in case of really deeply nested structures (maybe i'm mistaken, but i guess it's about couple of thousands levels of nesting), still it's suitable for the most cases.

Related

Create transducer from ordinary code

How would I create a transducer from the following ordinary code, where combo is the alias for clojure.math.combinatorics:
(defn row->evenly-divided [xs]
(->> (combo/combinations (sort-by - xs) 2)
(some (fn [[big small]]
(assert (>= big small))
(let [res (/ big small)]
(when (int? res)
res))))))
As noted in a comment transducers are only applicable for processing each item. With this is mind I've made the code a little more transducer friendly by shifting the sorting so that it is now being done for each item. I don't think there's anything that can be done about the combinations part however!
(defn row->evenly-divided [xs]
(->> (combo/combinations xs 2)
(some (fn [xy]
(let [res (apply / (sort-by - xy))]
(when (int? res)
res))))))
This is the same function but with an introduced transducer:
(def x-row->evenly-divided (comp
(map (partial sort-by -))
(map (partial apply /))
(filter int?)))
(defn row->evenly-divided-2 [xs]
(->> (combo/combinations xs 2)
(sequence x-row->evenly-divided)
first))

Clojure butlast vs drop-last

What is the difference between butlast and drop-last in Clojure ?
Is it only the laziness ? Should I should prefer one over the other ?
also, if you need to realize the whole collection, butlast is dramatically faster, which is logical if you look at their source:
(def
butlast (fn ^:static butlast [s]
(loop [ret [] s s]
(if (next s)
(recur (conj ret (first s)) (next s))
(seq ret)))))
(defn drop-last
([s] (drop-last 1 s))
([n s] (map (fn [x _] x) s (drop n s))))
so drop-last uses map, while butlast uses simple iteration with recur. Here is a little example:
user> (time (let [_ (butlast (range 10000000))]))
"Elapsed time: 2052.853726 msecs"
nil
user> (time (let [_ (doall (drop-last (range 10000000)))]))
"Elapsed time: 14072.259077 msecs"
nil
so i wouldn't blindly prefer one over another. I use drop-last only when i really need laziness, otherwise butlast.
Yes, laziness as well as the fact that drop-last can take also take n, indicating how many elements to drop from the end lazily.
There's a discussion here where someone is making the case that butlast is more readable and maybe a familiar idiom for Lisp programmers, but I generally just opt to use drop-last.

Futures somehow slower then agents?

The following code does essentially just let you execute something like (function (range n)) in parallel.
(experiment-with-agents 10000 10 #(filter prime? %))
This for example finds the prime numbers between 0 and 10000 with 10 agents.
(experiment-with-futures 10000 10 #(filter prime? %))
Same just with futures.
Now the problem is that the solution with futures doesn't run faster with more futures. Example:
; Futures
(time (experiment-with-futures 10000 1 #(filter prime? %)))
"Elapsed time: 33417.524634 msecs"
(time (experiment-with-futures 10000 10 #(filter prime? %)))
"Elapsed time: 33891.495702 msecs"
; Agents
(time (experiment-with-agents 10000 1 #(filter prime? %)))
"Elapsed time: 33048.80492 msecs"
(time (experiment-with-agents 10000 10 #(filter prime? %)))
"Elapsed time: 9211.864133 msecs"
Why? Did I do something wrong (probably, new to Clojure and just playing around with stuff^^)? Because I thought that futures are actually prefered in that scenario.
Source:
(defn setup-agents
[coll-size num-agents]
(let [step (/ coll-size num-agents)
parts (partition step (range coll-size))
agents (for [_ (range num-agents)] (agent []) )
vect (map #(into [] [%1 %2]) agents parts)]
(vec vect)))
(defn start-agents
[coll f]
(for [[agent part] coll] (send agent into (f part))))
(defn results
[agents]
(apply await agents)
(vec (flatten (map deref agents))))
(defn experiment-with-agents
[coll-size num-agents f]
(-> (setup-agents coll-size num-agents)
(start-agents f)
(results)))
(defn experiment-with-futures
[coll-size num-futures f]
(let [step (/ coll-size num-futures)
parts (partition step (range coll-size))
futures (for [index (range num-futures)] (future (f (nth parts index))))]
(vec (flatten (map deref futures)))))
You're getting tripped up by the fact that for produces a lazy sequence inside of experiment-with-futures. In particular, this piece of code:
(for [index (range num-futures)] (future (f (nth parts index))))
does not immediately create all of the futures; it returns a lazy sequence that will not create the futures until the contents of the sequence are realized. The code that realizes the lazy sequence is:
(vec (flatten (map deref futures)))
Here, map returns a lazy sequence of the dereferenced future results, backed by the lazy sequence of futures. As vec consumes results from the sequence produced by map, each new future is not submitted for processing until the previous one completes.
To get parallel processing, you need to not create the futures lazily. Try wrapping the for loop where you create the futures inside a doall.
The reason you're seeing an improvement with agents is the call to (apply await agents) immediately before you gather the agent results. Your start-agents function also returns a lazy sequence and does not actually dispatch the agent actions. An implementation detail of apply is that it completely realizes small sequences (under 20 items or so) passed to it. A side effect of passing agents to apply is that the sequence is realized and all agent actions are dispatched before it is handed off to await.

Computing folder size

I'm trying to compute folder size in parallel.
Maybe it's naive approach.
What I do, is that I give computation of every branch node (directory) to an agent.
All leaf nodes have their file sizes added to my-size.
Well it doesn't work. :)
'scan' works ok, serially.
'pscan' prints only files from first level.
(def agents (atom []))
(def my-size (atom 0))
(def root-dir (clojure.java.io/file "/"))
(defn scan [listing]
(doseq [f listing]
(if (.isDirectory f)
(scan (.listFiles f))
(swap! my-size #(+ % (.length f))))))
(defn pscan [listing]
(doseq [f listing]
(if (.isDirectory f)
(let [a (agent (.listFiles f))]
(do (swap! agents #(conj % a))
(send-off a pscan)
(println (.getName f))))
(swap! my-size #(+ % (.length f))))))
Do you have any idea, what have i done wrong?
Thanks.
No need to keep state using atoms. Pure functional:
(defn psize [f]
(if (.isDirectory f)
(apply + (pmap psize (.listFiles f)))
(.length f)))
So counting filesizes in parallel should be so easy?
It's not :)
I tried to solve this issue better. I realized that i'm doing blocking I/O operations so pmap doesn't do the job.
I was thinking maybe giving chunks of directories (branches) to agents to process it independently would make sense. Looks it does :)
Well I haven't benchmarked it yet.
It works, but, there might be some problems with symbolic links on UNIX-like systems.
(def user-dir (clojure.java.io/file "/home/janko/projects/"))
(def root-dir (clojure.java.io/file "/"))
(def run? (atom true))
(def *max-queue-length* 1024)
(def *max-wait-time* 1000) ;; wait max 1 second then process anything left
(def *chunk-size* 64)
(def queue (java.util.concurrent.LinkedBlockingQueue. *max-queue-length* ))
(def agents (atom []))
(def size-total (atom 0))
(def a (agent []))
(defn branch-producer [node]
(if #run?
(doseq [f node]
(when (.isDirectory f)
(do (.put queue f)
(branch-producer (.listFiles f)))))))
(defn producer [node]
(future
(branch-producer node)))
(defn node-consumer [node]
(if (.isFile node)
(.length node)
0))
(defn chunk-length []
(min (.size queue) *chunk-size*))
(defn compute-sizes [a]
(doseq [i (map (fn [f] (.listFiles f)) a)]
(swap! size-total #(+ % (apply + (map node-consumer i))))))
(defn consumer []
(future
(while #run?
(when-let [size (if (zero? (chunk-length))
false
(chunk-length))] ;appropriate size of work
(binding [a (agent [])]
(dotimes [_ size] ;give us all directories to process
(when-let [item (.poll queue)]
(set! a (agent (conj #a item)))))
(swap! agents #(conj % a))
(send-off a compute-sizes))
(Thread/sleep *max-wait-time*)))))
You can start it by typing
(producer (list user-dir))
(consumer)
For result type
#size-total
You can stop it by (there are running futures - correct me if I'm wrong)
(swap! run? not)
If you find any errors/mistakes, you're welcome to share your ideas!

How do I write a predicate that checks if a value exists in an infinite seq?

I had an idea for a higher-order function today that I'm not sure how to write. I have several sparse, lazy infinite sequences, and I want to create an abstraction that lets me check to see if a given number is in any of these lazy sequences. To improve performance, I wanted to push the values of the sparse sequence into a hashmap (or set), dynamically increasing the number of values in the hashmap whenever it is necessary. Automatic memoization is not the answer here due to sparsity of the lazy seqs.
Probably code is easiest to understand, so here's what I have so far. How do I change the following code so that the predicate uses a closed-over hashmap, but if needed increases the size of the hashmap and redefines itself to use the new hashmap?
(defn make-lazy-predicate [coll]
"Returns a predicate that returns true or false if a number is in
coll. Coll must be an ordered, increasing lazy seq of numbers."
(let [in-lazy-list? (fn [n coll top cache]
(if (> top n)
(not (nil? (cache n)))
(recur n (next coll) (first coll)
(conj cache (first coll)))]
(fn [n] (in-lazy-list? n coll (first coll) (sorted-set)))))
(def my-lazy-list (iterate #(+ % 100) 1))
(let [in-my-list? (make-lazy-predicate my-lazy-list)]
(doall (filter in-my-list? (range 10000))))
How do I solve this problem without reverting to an imperative style?
This is a thread-safe variant of Adam's solution.
(defn make-lazy-predicate
[coll]
(let [state (atom {:mem #{} :unknown coll})
update-state (fn [{:keys [mem unknown] :as state} item]
(let [[just-checked remainder]
(split-with #(<= % item) unknown)]
(if (seq just-checked)
(-> state
(assoc :mem (apply conj mem just-checked))
(assoc :unknown remainder))
state)))]
(fn [item]
(get-in (if (< item (first (:unknown #state)))
#state
(swap! state update-state item))
[:mem item]))))
One could also consider using refs, but than your predicate search might get rolled back by an enclosing transaction. This might or might not be what you want.
This function is based on the idea how the core memoize function works. Only numbers already consumed from the lazy list are cached in a set. It uses the built-in take-while instead of doing the search manually.
(defn make-lazy-predicate [coll]
(let [mem (atom #{})
unknown (atom coll)]
(fn [item]
(if (< item (first #unknown))
(#mem item)
(let [just-checked (take-while #(>= item %) #unknown)]
(swap! mem #(apply conj % just-checked))
(swap! unknown #(drop (count just-checked) %))
(= item (last just-checked)))))))