Computing folder size - clojure

I'm trying to compute folder size in parallel.
Maybe it's naive approach.
What I do, is that I give computation of every branch node (directory) to an agent.
All leaf nodes have their file sizes added to my-size.
Well it doesn't work. :)
'scan' works ok, serially.
'pscan' prints only files from first level.
(def agents (atom []))
(def my-size (atom 0))
(def root-dir (clojure.java.io/file "/"))
(defn scan [listing]
(doseq [f listing]
(if (.isDirectory f)
(scan (.listFiles f))
(swap! my-size #(+ % (.length f))))))
(defn pscan [listing]
(doseq [f listing]
(if (.isDirectory f)
(let [a (agent (.listFiles f))]
(do (swap! agents #(conj % a))
(send-off a pscan)
(println (.getName f))))
(swap! my-size #(+ % (.length f))))))
Do you have any idea, what have i done wrong?
Thanks.

No need to keep state using atoms. Pure functional:
(defn psize [f]
(if (.isDirectory f)
(apply + (pmap psize (.listFiles f)))
(.length f)))

So counting filesizes in parallel should be so easy?
It's not :)
I tried to solve this issue better. I realized that i'm doing blocking I/O operations so pmap doesn't do the job.
I was thinking maybe giving chunks of directories (branches) to agents to process it independently would make sense. Looks it does :)
Well I haven't benchmarked it yet.
It works, but, there might be some problems with symbolic links on UNIX-like systems.
(def user-dir (clojure.java.io/file "/home/janko/projects/"))
(def root-dir (clojure.java.io/file "/"))
(def run? (atom true))
(def *max-queue-length* 1024)
(def *max-wait-time* 1000) ;; wait max 1 second then process anything left
(def *chunk-size* 64)
(def queue (java.util.concurrent.LinkedBlockingQueue. *max-queue-length* ))
(def agents (atom []))
(def size-total (atom 0))
(def a (agent []))
(defn branch-producer [node]
(if #run?
(doseq [f node]
(when (.isDirectory f)
(do (.put queue f)
(branch-producer (.listFiles f)))))))
(defn producer [node]
(future
(branch-producer node)))
(defn node-consumer [node]
(if (.isFile node)
(.length node)
0))
(defn chunk-length []
(min (.size queue) *chunk-size*))
(defn compute-sizes [a]
(doseq [i (map (fn [f] (.listFiles f)) a)]
(swap! size-total #(+ % (apply + (map node-consumer i))))))
(defn consumer []
(future
(while #run?
(when-let [size (if (zero? (chunk-length))
false
(chunk-length))] ;appropriate size of work
(binding [a (agent [])]
(dotimes [_ size] ;give us all directories to process
(when-let [item (.poll queue)]
(set! a (agent (conj #a item)))))
(swap! agents #(conj % a))
(send-off a compute-sizes))
(Thread/sleep *max-wait-time*)))))
You can start it by typing
(producer (list user-dir))
(consumer)
For result type
#size-total
You can stop it by (there are running futures - correct me if I'm wrong)
(swap! run? not)
If you find any errors/mistakes, you're welcome to share your ideas!

Related

Create transducer from ordinary code

How would I create a transducer from the following ordinary code, where combo is the alias for clojure.math.combinatorics:
(defn row->evenly-divided [xs]
(->> (combo/combinations (sort-by - xs) 2)
(some (fn [[big small]]
(assert (>= big small))
(let [res (/ big small)]
(when (int? res)
res))))))
As noted in a comment transducers are only applicable for processing each item. With this is mind I've made the code a little more transducer friendly by shifting the sorting so that it is now being done for each item. I don't think there's anything that can be done about the combinations part however!
(defn row->evenly-divided [xs]
(->> (combo/combinations xs 2)
(some (fn [xy]
(let [res (apply / (sort-by - xy))]
(when (int? res)
res))))))
This is the same function but with an introduced transducer:
(def x-row->evenly-divided (comp
(map (partial sort-by -))
(map (partial apply /))
(filter int?)))
(defn row->evenly-divided-2 [xs]
(->> (combo/combinations xs 2)
(sequence x-row->evenly-divided)
first))

Read each entry lazily from a zip file

I want to read file entries in a zip file into a sequence of strings if possible. Currently I'm doing something like this to print out directory names for example:
(defn entries [zipfile]
(lazy-seq
(if-let [entry (.getNextEntry zipfile)]
(cons entry (entries zipfile)))))
(defn with-each-entry [fileName f]
(with-open [z (ZipInputStream. (FileInputStream. fileName))]
(doseq [e (entries z)]
; (println (.getName e))
(f e)
(.closeEntry z))))
(with-each-entry "tmp/my.zip"
(fn [e] (if (.isDirectory e)
(println (.getName e)))))
However this will iterate through the entire zip file. How could I change this so I could take the first few entries say something like:
(take 10 (zip-entries "tmp/my.zip"
(fn [e] (if (.isDirectory e)
(println (.getName e)))))
This seems like a pretty natural fit for the new transducers in CLJ 1.7.
You just build up the transformations you want as a transducer using comp and the usual seq-transforming fns with no seq/collection argument. In your example cases,
(comp (map #(.getName %)) (take 10)) and
(comp (filter #(.isDirectory %)) (map #(-> % .getName println))).
This returns a function of multiple arities which you can use in a lot of ways. In this case you want to eagerly reduce it over the entries sequence (to ensure realization of the entries happens inside with-open), so you use transduce (example zip data made by zipping one of my clojure project folders):
(with-open [z (-> "training-day.zip" FileInputStream. ZipInputStream.)]
(let[transform (comp (map #(.getName %)) (take 10))]
(transduce transform conj (entries z))))
;;return value: [".gitignore" ".lein-failures" ".midje-grading-config.clj" ".nrepl-port" ".travis.yml" "project.clj" "README.md" "target/" "target/classes/" "target/repl-port"]
Here I'm transducing with base function conj which makes a vector of the names. If you instead want your transducer to perform side-effects and not return a value, you can do that with a base function like (constantly nil):
(with-open [z (-> "training-day.zip" FileInputStream. ZipInputStream.)]
(let[transform (comp (filter #(.isDirectory %)) (map #(-> % .getName println)))]
(transduce transform (constantly nil) (entries z))))
which gives output:
target/
target/classes/
target/stale/
test/
A potential downside with this is that you'll probably have to manually incorporate .closeEntry calls into each transducer you use here to prevent holding those resources, because you can't in the general case know when each transducer is done reading the entry.

Optimal way to iterate over a core.async channel for printing?

I'm trying to dump the results of a core.async channel to stdout.
Here is what I have (simplified example):
(use 'clojure.core.async)
(def mychan (to-chan (range 100)))
(loop []
(let [a (<!! mychan)]
(if (not (nil? a))
(do
(println a)
(recur)))))
Now I think I should be able to replace this with map:
(map (fn [a] (println a) a) [mychan])
But this seems to be lazy and doesn't return any results.
I can't help but feel that my loop function is somewhat of a workaround. My question is - What is the optimal way to iterate over a core.async channel for printing?
Better use go-loop to run logging process in go block:
(def c (async/chan 10))
(go-loop []
(when-let [msg (<! c)]
(println msg)
(recur)))
(async/onto-chan c (range 100))
or you can use transducers on 1.7
(def prn-chan (async/chan 10 (map println)))
(async/onto-chan prn-chan (range 100))

Clojure, Agent, lack side-effects

I'm using agents to manipulate a structure, but I don't have all my side effects.
All the messages are sent(I've printed and counted them), but there are times when I don't have all my side-effects. As if not all of my functions were applied on the agent's state, or if the last send is applied on a previous state..
I experimented with doall, dorun but haven't find a solution, appreciate any help.
;; aux function for adding an element to a hashmap
(defn extend-regs [reg s o]
(let [os (get reg s)]
(if (nil? os)
(assoc reg s [o])
(assoc reg s (conj os o)))))
;; the agent's altering function - adding an element to the :regs field(a hashmap)
(defn add-reg! [d s o]
(send d (fn [a] (assoc a :regs (extend-regs (:regs a) s o)))))
;; Creating the agents, dct/init returns an agent
;; pds: data for fields
(defn pdcts->init-dcts! [pds]
(doall (map dct/init (map :nam pds) (repeat nil))))
;; Altering one agent's state, dct/add-reg sends an assoc message to the agent
;; d: agent, pd: data for fields
(defn dct->add-regs! [d pd]
(dorun (map (fn [s r] (dct/add-reg! d s r))
(:syms pd)
(:regs pd)))
d)
;; Going through all agents
;; ds: agents, pds: datas
(defn dcts->add-regs! [ds pds]
(dorun (map (fn [d pd] (dct->add-regs! d pd))
ds
pds))
ds)
EDIT: =====================================================
Okay it turned out I just haven't wait enough to my threads to finish their tasks. Now the question is how can I monitor my agents. How can I know that there are unfinished threads in the queue? I've only found swank.core/active-threads and similar ones but they are not a solution.
I do not have a solution for your problem, but I can't resist suggesting some improvement on the first two functions:
(defn extend-regs [reg s o]
(let [os (get reg s)]
(if (nil? os)
(assoc reg s [o])
(assoc reg s (conj os o)))))
;; => place the 'if inside the assoc:
(defn extend-regs [reg s o]
(let [os (get reg s)]
(assoc reg s (if (nil? os) [o] (conj os o)))))
;; => this (if (nil? x) ...) is the pattern of function fnil, so ...
(defn extend-regs [reg s o]
(let [os (get reg s)]
(assoc reg s ((fnil conj []) os o))))
;; with update-in, this will be even clearer, and we can remove the let altogether:
(defn extend-regs [reg s o]
(update-in reg [s] (fnil conj []) o))
As for the second:
(defn add-reg! [d s o]
(send d (fn [a] (assoc a :regs (extend-regs (:regs a) s o)))))
;; => We can, again, use update-in instead of assoc:
(defn add-reg! [d s o]
(send d (fn [a] (update-in a [:regs] extend-regs s o))))
;; or, if you can get rid of extend-regs:
(defn add-reg! [d s o]
(send d (fn [a] (update-in a [:regs s] (fnil conj []) o)))
Finally, as a matter of style, I would place add-reg in a separate function, and directly use the idiom of sending to the agent in the client code (or have a simplified add-reg! function):
(defn add-reg [v s o] (update-in v [:regs s] (fnil conj []) o))
(defn add-reg! [d s o] (send d add-reg))
I know this doesn't answer your initial question, but it was fun to write this step by step refactoring
Use await or await-for to wait until an agent finished it's current work queue:
(await agent1 agent2 agent3)
or
(apply await list-of-agents)
A minor improvement of add-reg:
(defn extend-regs [reg s o]
(update-in reg [s] conj o))
This works because of
(conj nil :b) ; => [:b]
thus
(update-in {} [:a] conj :b) ; => {:a [:b]}
finally we have.
(defn add-reg! [d s o]
(send d update-in s [:regs] conj o)

Let over lambda block-scanner in clojure

I have just start reading Let over lambda and I thought I would try and write a clojure version of the block-scanner in the closures chapter.
I have the following so far:
(defn block-scanner [trigger-string]
(let [curr (ref trigger-string) trig trigger-string]
(fn [data]
(doseq [c data]
(if (not (empty? #curr))
(dosync(ref-set curr
(if (= (first #curr) c)
(rest #curr)
trig)))))
(empty? #curr))))
(def sc (block-scanner "jihad"))
This works I think, but I would like know what I did right and what I could do better.
I would not use ref-set but alter because you don't reset the state to a completely new value, but update it to a new value which is obtained from the old one.
(defn block-scanner
[trigger-string]
(let [curr (ref trigger-string)
trig trigger-string]
(fn [data]
(doseq [c data]
(when (seq #curr)
(dosync
(alter curr
#(if (-> % first (= c))
(rest %)
trig)))))
(empty? #curr))))
Then it is not necessary to use refs since you don't have to coordinate changes. Here an atom is a better fit, since it can be changed without all the STM ceremony.
(defn block-scanner
[trigger-string]
(let [curr (atom trigger-string)
trig trigger-string]
(fn [data]
(doseq [c data]
(when (seq #curr)
(swap! curr
#(if (-> % first (= c))
(rest %)
trig))))
(empty? #curr))))
Next I would get rid of the imperative style.
it does more than it should: it traverses all data - even if we found a match already. We should stop early.
it is not thread-safe, since we access the atom multiple times - it might change in between. So we must touch the atom only once. (Although this is probably not interesting in this case, but it's good to make it a habit.)
it's ugly. We can do all the work functionally and just save the state, when we come to a result.
(defn block-scanner
[trigger-string]
(let [state (atom trigger-string)
advance (fn [trigger d]
(when trigger
(condp = d
(first trigger) (next trigger)
; This is maybe a bug in the book. The book code
; matches "foojihad", but not "jijihad".
(first trigger-string) (next trigger-string)
trigger-string)))
update (fn [trigger data]
(if-let [data (seq data)]
(when-let [trigger (advance trigger (first data))]
(recur trigger (rest data)))
trigger))]
(fn [data]
(nil? (swap! state update data)))))