I have an input csv file and need to generate an output file that has one line for each input line. Each input line could be of a specific type (say "old" or "new") that can be determined only by processing the input line.
In addition to generating the output file, we also want to print the summary of how many lines of each type were in the input file. My actual task involves generating different SQLs based on the input line type, but to keep the example code focussed, I have kept the processing in the function proc-line simple. The function func determines what type an input line is -- again, I have kept it simple by randomly generating a type. The actual logic is more involved.
I have the following code and it does the job. However, to retain a functional style for the task of generating the summary, I chose to return a keyword to signify the type of each line and created a lazy sequence of these for generating the final summary. In an imperative style, we would simply increment a count for each line type. Generating a potentially large collection just for summarizing seems inefficient. Another consequence of the way I have coded it is the repetition of the (.write writer ...) portion. Ideally, I would code that just once.
Any suggestions for eliminating the two problems I have identified (and others)?
(ns file-proc.core
(:gen-class)
(:require [clojure.data.csv :as csv]
[clojure.java.io :as io]))
(defn func [x]
(rand-nth [true false]))
(defn proc-line [line writer]
(if (func line)
(do (.write writer (str line "\n")) :new)
(do (.write writer (str (reverse line) "\n")) :old)))
(defn generate-report [from to]
(with-open
[reader (io/reader from)
writer (io/writer to)]
(->> (csv/read-csv reader)
(rest)
(map #(proc-line % writer))
(frequencies)
(doall))))
I'd try to separate data processing from side-effects like reading/writing files. Hopefully this would allow the IO operations to stay at opposite boundaries of the pipeline, and the "middle" processing logic is agnostic of where the input comes from and where the output is going.
(defn rand-bool [] (rand-nth [true false]))
(defn proc-line [line]
(if (rand-bool)
[line :new]
[(reverse line) :old]))
proc-line no longer takes a writer, it only cares about the line and it returns a vector/2-tuple of the processed line along with a keyword. It doesn't concern itself with string formatting either—we should let csv/write-csv do that. Now you could do something like this:
(defn process-lines [reader]
(->> (csv/read-csv reader)
(rest)
(map proc-line)))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (process-lines reader)]
(csv/write-csv writer (map first lines))
(frequencies (map second lines)))))
This will work but it's going to realize/keep the entire input sequence in memory, which you don't want for large files. We need a way to keep this pipeline lazy/efficient, but we also have to produce two "streams" from one and in a single pass: the processed lines only to be sent to write-csv, and each line's metadata for calculating frequencies. One "easy" way to do this is to introduce some mutability to track the metadata frequencies as the lazy sequence is consumed by write-csv:
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [freqs (atom {})]
(->> (csv/read-csv reader)
;; processing starts
(rest)
(map (fn [line]
(let [[row tag] (proc-line line)]
(swap! freqs update tag (fnil inc 0))
row)))
;; processing ends
(csv/write-csv writer))
#freqs)))
I removed the process-lines call to make the full pipeline more apparent. By the time write-csv has fully (and lazily) consumed its payload, freqs will be a map like {:old 23, :new 31} which will be the return value of generate-report. There's room for improvement/generalization, but I think this is a start.
As others have mentioned, separating writing and processing work would be ideal. Here's how I usually do this:
(defn product-type [p]
(rand-nth [:new :old]))
(defn row->product [row]
(let [p (zipmap [:id :name :price] row)]
(assoc p :type (product-type p))))
(defmulti to-csv :type)
(defmethod to-csv :new [product] ...)
(defmethod to-csv :old [product] ...)
(defn generate-report [from to]
(with-open [rdr (io/reader from)
wrtr (io/writer to)]
(->> (rest (csv/read-csv rdr))
(map row->product)
(map #(do (.write wrtr (to-csv %)) %))
(map :type)
(frequencies)
(doall))))
(The code might not work—didn't run it, sorry.)
Constructing a hash-map and using multimethods is optional, of course, but it's better to assign a product its type first. This way its data dictates what pipeline is doing, not proc-line.
To refactor the code we need the safety net of at least one characterization test for generate-report. Since that function does file I/O (we will make the code independent from I/O later), we will use this sample CSV file, f1.csv:
Year,Code
1997,A
2000,B
2010,C
1996,D
2001,E
We cannot yet write a test because function func uses a RNG, so we rewrite it to be deterministic by actually looking at the input. While there, we rename it to new?, which is more representative of the problem:
(defn new? [row]
(>= (Integer/parseInt (first row)) 2000))
where, for the sake of the exercise, we assume that a row is "new" if the Year column is >= 2000.
We can now write the test and see it pass (here for brevity we focus only on the frequency calculation, not on the output transformation):
(deftest characterization-as-posted
(is (= {:old 2, :new 3}
(generate-report "f1.csv" "f1.tmp"))))
And now to the refactoring. The main idea is to realize that we need an accumulator, replacing map with reduce and getting rid of frequencies and of doall. Also, we rename "line" with "row", since this is how a line is called in the CSV format:
(defn generate-report [from to] ; 1
(let [[old new _] ; 2
(with-open [reader (io/reader from) ; 3
writer (io/writer to)] ; 4
(->> (csv/read-csv reader) ; 5
(rest) ; 6
(reduce process-row [0 0 writer])))] ; 7
{:old old :new new})) ; 8
The new process-row (originally process-line) becomes:
(defn process-row [[old new writer] row]
(if (new? row)
(do (.write writer (str row "\n")) [old (inc new) writer])
(do (.write writer (str (reverse row) "\n")) [(inc old) new writer])))
Function process-row, as any function to be passed to reduce, has two arguments: first argument [old new writer] is a vector of two accumulators and of the I/O writer (the vector is destructured); second argument row is one element of the collection that is being reduced. It returns the new vector of accumulators, that at the end of the collection is destructured in line 2 of generate-report and used at line 8 to create a hashmap equivalent to the one previously returned by frequencies.
We can do one last refactoring: separate the file I/O from the business logic, so that we can write tests without the scaffolding of preparated input files, as follows.
Function process-row becomes:
(defn process-row [[old-cnt new-cnt writer] row]
(let [[out-row old new] (process-row-pure old-cnt new-cnt row)]
(do (.write writer out-row)
[old new writer])))
and the business logic can be done by the pure (and so easily testable) function:
(defn process-row-pure [old new row]
(if (new? row)
[(str row "\n") old (inc new)]
[(str (reverse row) "\n") (inc old) new]))
All this without mutating anything.
IMHO, I would separate the two different aspects: counting the frequencies and writing to a file:
(defn count-lines
([lines] (count-lines lines 0 0))
([lines count-old count-new]
(if-let [line (first lines)]
(if (func line)
(recur count-old (inc count-new) (rest lines))
(recur (inc count-old) count-new (rest lines)))
{:new count-new :old count-old})))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (rest (csv/read-csv reader))
frequencies (count-lines lines)]
(doseq [line lines]
(.write writer (str line "\n"))))))
Suppose I have a very simple .clj file on disk with the following content:
(def a 2)
(def b 3)
(defn add-two [x y] (+ x y))
(println (add-two a b))
From the context of separate program, I would like to read the above program as a list of S-Expressions, '((def a 2) (def b 3) ... (add-two a b))).
I imagine that one way of doing this involves 1. Using slurp on (io/file file-name.clj) to produce a string containing the file's contents, 2. passing that string to a parser for Clojure code, and 3. injecting the sequence produced by the parser to a list (i.e., (into '() parsed-code)).
However, this approach seems sort of clumsy and error prone. Does anyone know of a more elegant and/or idiomatic way to read a Clojure file as a list of S-Expressions?
Update: Following up on feedback from the comments section, I've decided to try the approach I mentioned on an actual source file using aphyr's clj-antlr as follows:
=> (def file-as-string (slurp (clojure.java.io/file "src/tcl/core.clj")))
=> tcl.core=> (pprint (antlr/parser "src/grammars/Clojure.g4" file-as-string))
{:parser
{:local
#object[java.lang.ThreadLocal 0x5bfcab6 "java.lang.ThreadLocal#5bfcab6"],
:grammar
#object[org.antlr.v4.tool.Grammar 0x5b8cfcb9 "org.antlr.v4.tool.Grammar#5b8cfcb9"]},
:opts
"(ns tcl.core\n (:gen-class)\n (:require [clj-antlr.core :as antlr]))\n\n(def foo 42)\n\n(defn parse-program\n \"uses antlr grammar to \"\n [program]\n ((antlr/parser \"src/grammars/Clojure.g4\") program))\n\n\n(defn -main\n \"I don't do a whole lot ... yet.\"\n [& args]\n (println \"tlc is tcl\"))\n"}
nil
Does anyone know how to transform this output to a list of S-Expressions as originally intended? That is, how might one go about squeezing valid Clojure code/data from the result of parsing with clj-antlr?
(import '[java.io PushbackReader])
(require '[clojure.java.io :as io])
(require '[clojure.edn :as edn])
;; adapted from: http://stackoverflow.com/a/24922859/6264
(defn read-forms [file]
(let [rdr (-> file io/file io/reader PushbackReader.)
sentinel (Object.)]
(loop [forms []]
(let [form (edn/read {:eof sentinel} rdr)]
(if (= sentinel form)
forms
(recur (conj forms form)))))))
(comment
(spit "/tmp/example.clj"
"(def a 2)
(def b 3)
(defn add-two [x y] (+ x y))
(println (add-two a b))")
(read-forms "/tmp/example.clj")
;;=> [(def a 2) (def b 3) (defn add-two [x y] (+ x y)) (println (add-two a b))]
)
Do you need something like this?
(let [exprs (slurp "to_read.clj")]
;; adding braces to form a proper list
(-> (str "(" (str exprs")"))
;; read-string is potentially harmful, since it evals the string
;; there exist non-evaluating readers for clojure but I don't know
;; which one are good
(read-string)
(prn)))
I want to read file entries in a zip file into a sequence of strings if possible. Currently I'm doing something like this to print out directory names for example:
(defn entries [zipfile]
(lazy-seq
(if-let [entry (.getNextEntry zipfile)]
(cons entry (entries zipfile)))))
(defn with-each-entry [fileName f]
(with-open [z (ZipInputStream. (FileInputStream. fileName))]
(doseq [e (entries z)]
; (println (.getName e))
(f e)
(.closeEntry z))))
(with-each-entry "tmp/my.zip"
(fn [e] (if (.isDirectory e)
(println (.getName e)))))
However this will iterate through the entire zip file. How could I change this so I could take the first few entries say something like:
(take 10 (zip-entries "tmp/my.zip"
(fn [e] (if (.isDirectory e)
(println (.getName e)))))
This seems like a pretty natural fit for the new transducers in CLJ 1.7.
You just build up the transformations you want as a transducer using comp and the usual seq-transforming fns with no seq/collection argument. In your example cases,
(comp (map #(.getName %)) (take 10)) and
(comp (filter #(.isDirectory %)) (map #(-> % .getName println))).
This returns a function of multiple arities which you can use in a lot of ways. In this case you want to eagerly reduce it over the entries sequence (to ensure realization of the entries happens inside with-open), so you use transduce (example zip data made by zipping one of my clojure project folders):
(with-open [z (-> "training-day.zip" FileInputStream. ZipInputStream.)]
(let[transform (comp (map #(.getName %)) (take 10))]
(transduce transform conj (entries z))))
;;return value: [".gitignore" ".lein-failures" ".midje-grading-config.clj" ".nrepl-port" ".travis.yml" "project.clj" "README.md" "target/" "target/classes/" "target/repl-port"]
Here I'm transducing with base function conj which makes a vector of the names. If you instead want your transducer to perform side-effects and not return a value, you can do that with a base function like (constantly nil):
(with-open [z (-> "training-day.zip" FileInputStream. ZipInputStream.)]
(let[transform (comp (filter #(.isDirectory %)) (map #(-> % .getName println)))]
(transduce transform (constantly nil) (entries z))))
which gives output:
target/
target/classes/
target/stale/
test/
A potential downside with this is that you'll probably have to manually incorporate .closeEntry calls into each transducer you use here to prevent holding those resources, because you can't in the general case know when each transducer is done reading the entry.
I need to send a string that is a path of a directory to a function.how do I do that in Clojure?
I tried doing the following but it didn't work
(defn make-asm-file [d]
(doseq [f (.listFiles d)]
(if
( and (=(str(last (split (.getName f) #"\."))) "vm") (not (.isDirectory f)))
(translate f d))))
(make-asm-file "~\SimpleAdd")
How about the following,
(defn find-files
[regexp directory]
(filter #(and (.isFile %)
(re-find regexp (str %)))
(.listFiles (clojure.java.io/file directory))))
(doseq [f (find-files #".vm$" "~/SimpleAdd")]
(translate f))
In this case (java.io.File. d) would work instead of (clojure.java.io/file d) as well. You could also use file-seq instead of listfiles, but that would include all *.vm files in subdirs.
I'm working through a book on clojure and ran into a stumbling block with "->>". The author provides an example of a comp that converts camelCased keywords into a clojure map with a more idiomatic camel-cased approach. Here's the code using comp:
(require '[clojure.string :as str])
(def camel->keyword (comp keyword
str/join
(partial interpose \-)
(partial map str/lower-case)
#(str/split % #"(?<=[a-z])(?=[A-Z])")))
This makes a lot of sense, but I don't really like using partial all over the place to handle a variable number of arguments. Instead, an alternative is provided here:
(defn camel->keyword
[s]
(->> (str/split s #"(?<=[a-z])(?=[A-Z])")
(map str/lower-case)
(interpose \-)
str/join
keyword))
This syntax is much more readable, and mimics the way I would think about solving a problem (front to back, instead of back to front). Extending the comp to complete the aforementioned goal...
(def camel-pairs->map (comp (partial apply hash-map)
(partial map-indexed (fn [i x]
(if (odd? i)
x
(camel->keyword x))))))
What would be the equivalent using ->>? I'm not exactly sure how to thread map-indexed (or any iterative function) using ->>. This is wrong:
(defn camel-pairs->map
[s]
(->> (map-indexed (fn [i x]
(if (odd? i)
x
(camel-keyword x)))
(apply hash-map)))
Three problems: missing a parenthesis, missing the > in the name of camel->keyword, and not "seeding" your ->> macro with the initial expression s.
(defn camel-pairs->map [s]
(->> s
(map-indexed
(fn [i x]
(if (odd? i)
x
(camel->keyword x))))
(apply hash-map)))
Is this really more clear than say?
(defn camel-pairs->map [s]
(into {}
(for [[k v] (partition 2 s)]
[(camel->keyword k) v])))