Read csv into a list in clojure - clojure

I know there are a lot of related questions, I have read them but still have not gained some fundamental understanding of how to read-process-write. Take the following function for example which uses clojure-csv library to parse a line
(defn take-csv
"Takes file name and reads data."
[fname]
(with-open [file (reader fname)]
(doseq [line (line-seq file)]
(let [record (parse-csv line)]))))
What I would like to obtain is data read into some collection as a result of (def data (take-csv "file.csv")) and later to process it. So basically my question is how do I return record or rather a list of records.

"doseq" is often used for operations with side effect. In your case to create collection of records you can use "map":
(defn take-csv
"Takes file name and reads data."
[fname]
(with-open [file (reader fname)]
(doall (map (comp first csv/parse-csv) (line-seq file)))))
Better parse the whole file at ones to reduce code:
(defn take-csv
"Takes file name and reads data."
[fname]
(with-open [file (reader fname)]
(csv/parse-csv (slurp file))))
You also can use clojure.data.csv instead of clojure-csv.core. Only should rename parse-csv to take-csv in previous function.
(defn put-csv [fname table]
(with-open [file (writer fname)]
(csv/write-csv file table)))

With all the things you can do with .csv files, I suggest using clojure-csv or clojure.data.csv. I mostly use clojure-csv to read in a .csv file.
Here are some code snippets from a utility library I use with most of my Clojure programs.
from util.core
(ns util.core
^{:author "Charles M. Norton",
:doc "util is a Clojure utilities directory"}
(:require [clojure.string :as cstr])
(:import java.util.Date)
(:import java.io.File)
(:use clojure-csv.core))
(defn open-file
"Attempts to open a file and complains if the file is not present."
[file-name]
(let [file-data (try
(slurp file-name)
(catch Exception e (println (.getMessage e))))]
file-data))
(defn ret-csv-data
"Returns a lazy sequence generated by parse-csv.
Uses open-file which will return a nil, if
there is an exception in opening fnam.
parse-csv called on non-nil file, and that
data is returned."
[fnam]
(let [csv-file (open-file fnam)
inter-csv-data (if-not (nil? csv-file)
(parse-csv csv-file)
nil)
csv-data
(vec (filter #(and pos? (count %)
(not (nil? (rest %)))) inter-csv-data))]
(if-not (empty? csv-data)
(pop csv-data)
nil)))
(defn fetch-csv-data
"This function accepts a csv file name, and returns parsed csv data,
or returns nil if file is not present."
[csv-file]
(let [csv-data (ret-csv-data csv-file)]
csv-data))
Once you've read in a .csv file, then what you do with its contents is another matter. Usually, I am taking .csv "reports" from one financial system, like property assessments, and formatting the data to be uploaded into a database of another financial system, like billing.
I will often either zipmap each .csv row so I can extract data by column name (having read in the column names), or even make a sequence of zipmap'ped .csv rows.

Just to add this good answers, here is a full example
First, add clojure-csv into your dependencies
(ns scripts.csvreader
(:require [clojure-csv.core :as csv]
[clojure.java.io :as io]))
(defn take-csv
"Takes file name and reads data."
[fname]
(with-open [file (io/reader fname)]
(-> file
(slurp)
(csv/parse-csv))))
usage
(take-csv "/path/youfile.csv")

Related

Using proper functional style in a file processing task

I have an input csv file and need to generate an output file that has one line for each input line. Each input line could be of a specific type (say "old" or "new") that can be determined only by processing the input line.
In addition to generating the output file, we also want to print the summary of how many lines of each type were in the input file. My actual task involves generating different SQLs based on the input line type, but to keep the example code focussed, I have kept the processing in the function proc-line simple. The function func determines what type an input line is -- again, I have kept it simple by randomly generating a type. The actual logic is more involved.
I have the following code and it does the job. However, to retain a functional style for the task of generating the summary, I chose to return a keyword to signify the type of each line and created a lazy sequence of these for generating the final summary. In an imperative style, we would simply increment a count for each line type. Generating a potentially large collection just for summarizing seems inefficient. Another consequence of the way I have coded it is the repetition of the (.write writer ...) portion. Ideally, I would code that just once.
Any suggestions for eliminating the two problems I have identified (and others)?
(ns file-proc.core
(:gen-class)
(:require [clojure.data.csv :as csv]
[clojure.java.io :as io]))
(defn func [x]
(rand-nth [true false]))
(defn proc-line [line writer]
(if (func line)
(do (.write writer (str line "\n")) :new)
(do (.write writer (str (reverse line) "\n")) :old)))
(defn generate-report [from to]
(with-open
[reader (io/reader from)
writer (io/writer to)]
(->> (csv/read-csv reader)
(rest)
(map #(proc-line % writer))
(frequencies)
(doall))))
I'd try to separate data processing from side-effects like reading/writing files. Hopefully this would allow the IO operations to stay at opposite boundaries of the pipeline, and the "middle" processing logic is agnostic of where the input comes from and where the output is going.
(defn rand-bool [] (rand-nth [true false]))
(defn proc-line [line]
(if (rand-bool)
[line :new]
[(reverse line) :old]))
proc-line no longer takes a writer, it only cares about the line and it returns a vector/2-tuple of the processed line along with a keyword. It doesn't concern itself with string formatting either—we should let csv/write-csv do that. Now you could do something like this:
(defn process-lines [reader]
(->> (csv/read-csv reader)
(rest)
(map proc-line)))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (process-lines reader)]
(csv/write-csv writer (map first lines))
(frequencies (map second lines)))))
This will work but it's going to realize/keep the entire input sequence in memory, which you don't want for large files. We need a way to keep this pipeline lazy/efficient, but we also have to produce two "streams" from one and in a single pass: the processed lines only to be sent to write-csv, and each line's metadata for calculating frequencies. One "easy" way to do this is to introduce some mutability to track the metadata frequencies as the lazy sequence is consumed by write-csv:
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [freqs (atom {})]
(->> (csv/read-csv reader)
;; processing starts
(rest)
(map (fn [line]
(let [[row tag] (proc-line line)]
(swap! freqs update tag (fnil inc 0))
row)))
;; processing ends
(csv/write-csv writer))
#freqs)))
I removed the process-lines call to make the full pipeline more apparent. By the time write-csv has fully (and lazily) consumed its payload, freqs will be a map like {:old 23, :new 31} which will be the return value of generate-report. There's room for improvement/generalization, but I think this is a start.
As others have mentioned, separating writing and processing work would be ideal. Here's how I usually do this:
(defn product-type [p]
(rand-nth [:new :old]))
(defn row->product [row]
(let [p (zipmap [:id :name :price] row)]
(assoc p :type (product-type p))))
(defmulti to-csv :type)
(defmethod to-csv :new [product] ...)
(defmethod to-csv :old [product] ...)
(defn generate-report [from to]
(with-open [rdr (io/reader from)
wrtr (io/writer to)]
(->> (rest (csv/read-csv rdr))
(map row->product)
(map #(do (.write wrtr (to-csv %)) %))
(map :type)
(frequencies)
(doall))))
(The code might not work—didn't run it, sorry.)
Constructing a hash-map and using multimethods is optional, of course, but it's better to assign a product its type first. This way its data dictates what pipeline is doing, not proc-line.
To refactor the code we need the safety net of at least one characterization test for generate-report. Since that function does file I/O (we will make the code independent from I/O later), we will use this sample CSV file, f1.csv:
Year,Code
1997,A
2000,B
2010,C
1996,D
2001,E
We cannot yet write a test because function func uses a RNG, so we rewrite it to be deterministic by actually looking at the input. While there, we rename it to new?, which is more representative of the problem:
(defn new? [row]
(>= (Integer/parseInt (first row)) 2000))
where, for the sake of the exercise, we assume that a row is "new" if the Year column is >= 2000.
We can now write the test and see it pass (here for brevity we focus only on the frequency calculation, not on the output transformation):
(deftest characterization-as-posted
(is (= {:old 2, :new 3}
(generate-report "f1.csv" "f1.tmp"))))
And now to the refactoring. The main idea is to realize that we need an accumulator, replacing map with reduce and getting rid of frequencies and of doall. Also, we rename "line" with "row", since this is how a line is called in the CSV format:
(defn generate-report [from to] ; 1
(let [[old new _] ; 2
(with-open [reader (io/reader from) ; 3
writer (io/writer to)] ; 4
(->> (csv/read-csv reader) ; 5
(rest) ; 6
(reduce process-row [0 0 writer])))] ; 7
{:old old :new new})) ; 8
The new process-row (originally process-line) becomes:
(defn process-row [[old new writer] row]
(if (new? row)
(do (.write writer (str row "\n")) [old (inc new) writer])
(do (.write writer (str (reverse row) "\n")) [(inc old) new writer])))
Function process-row, as any function to be passed to reduce, has two arguments: first argument [old new writer] is a vector of two accumulators and of the I/O writer (the vector is destructured); second argument row is one element of the collection that is being reduced. It returns the new vector of accumulators, that at the end of the collection is destructured in line 2 of generate-report and used at line 8 to create a hashmap equivalent to the one previously returned by frequencies.
We can do one last refactoring: separate the file I/O from the business logic, so that we can write tests without the scaffolding of preparated input files, as follows.
Function process-row becomes:
(defn process-row [[old-cnt new-cnt writer] row]
(let [[out-row old new] (process-row-pure old-cnt new-cnt row)]
(do (.write writer out-row)
[old new writer])))
and the business logic can be done by the pure (and so easily testable) function:
(defn process-row-pure [old new row]
(if (new? row)
[(str row "\n") old (inc new)]
[(str (reverse row) "\n") (inc old) new]))
All this without mutating anything.
IMHO, I would separate the two different aspects: counting the frequencies and writing to a file:
(defn count-lines
([lines] (count-lines lines 0 0))
([lines count-old count-new]
(if-let [line (first lines)]
(if (func line)
(recur count-old (inc count-new) (rest lines))
(recur (inc count-old) count-new (rest lines)))
{:new count-new :old count-old})))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (rest (csv/read-csv reader))
frequencies (count-lines lines)]
(doseq [line lines]
(.write writer (str line "\n"))))))

How to download a file and unzip it from memory in clojure?

I'm making a GET request using clj-http and the response is a zip file. The contents of this zip is always one CSV file. I want to save the CSV file to disk, but I can't figure out how.
If I have the file on disk, (fs/unzip filename destination) from the Raynes/fs library works great, but I can't figure out how I can coerce the response from clj-http into something this can read. If possible, I'd like to unzip the file directly without
The closest I've gotten (if this is even close) gets me to a BufferedInputStream, but I'm lost from there.
(require '[clj-http.client :as client])
(require '[clojure.java.io :as io])
(->
(client/get "http://localhost:8000/blah.zip" {:as :byte-array})
(:body)
(io/input-stream))
You can use the pure java java.util.zip.ZipInputStream or java.util.zip.GZIPInputStream. Depends how the content is zipped. This is the code that saves your file using java.util.zip.GZIPInputStream :
(->
(client/get "http://localhost:8000/blah.zip" {:as :byte-array})
(:body)
(io/input-stream)
(java.util.zip.GZIPInputStream.)
(clojure.java.io/copy (clojure.java.io/file "/path/to/output/file")))
Using java.util.zip.ZipInputStream makes it only a bit more complicated :
(let [stream (->
(client/get "http://localhost:8000/blah.zip" {:as :byte-array})
(:body)
(io/input-stream)
(java.util.zip.ZipInputStream.))]
(.getNextEntry stream)
(clojure.java.io/copy stream (clojure.java.io/file "/path/to/output/file")))
(require '[clj-http.client :as httpc])
(import '[java.io File])
(defn download-unzip [url dir]
(let [saveDir (File. dir)]
(with-open [stream (-> (httpc/get url {:as :stream})
(:body)
(java.util.zip.ZipInputStream.))]
(loop [entry (.getNextEntry stream)]
(if entry
(let [savePath (str dir File/separatorChar (.getName entry))
saveFile (File. savePath)]
(if (.isDirectory entry)
(if-not (.exists saveFile)
(.mkdirs saveFile))
(let [parentDir (File. (.substring savePath 0 (.lastIndexOf savePath (int File/separatorChar))))]
(if-not (.exists parentDir) (.mkdirs parentDir))
(clojure.java.io/copy stream saveFile)))
(recur (.getNextEntry stream))))))))

Clojure - Speed up large file processing

I need to read large file (~1GB), process it and save to db. My solution looks like that:
data.txt
format: [id],[title]\n
1,Foo
2,Bar
...
code
(ns test.core
(:require [clojure.java.io :as io]
[clojure.string :refer [split]]))
(defn parse-line
[line]
(let [values (split line #",")]
(zipmap [:id :title] values)))
(defn run
[]
(with-open [reader (io/reader "~/data.txt")]
(insert-batch (map parse-line (line-seq reader)))))
; insert-batch just save vector of records into database
But this code does not work well, because it first parse all lines and then send them into database.
I think the ideal solution would be read line -> parse line -> collect 1000 parsed lines -> batch insert them into database -> repeat until there is no lines. Unfortunately, I have no idea how to implement this.
One suggestion:
Use line-seq to get a lazy sequence of lines,
use map to parse each line,
(so far this matches what you are doing)
use partition-all to partition your lazy sequence of parsed lines into batches, and then
use insert-batch with doseq to write each batch to the database.
And an example:
(->> (line-seq reader)
(map parse-line)
(partition-all 1000)
(#(doseq [batch %]
(insert-batch batch))))

read tab delimited file in clojure

How do I read a tab-delimited file using Clojure? There may be whitespaces in a line which do not correspond to a tab.
E.g.: transform
some field another-field a third field
into
["some field" "another-field" "a third field"]
You can use the data.csv Contrib library:
;; in your :dependencies
[org.clojure/data.csv "0.1.2"]
;; at the REPL
(require '[clojure.data.csv :as csv])
(csv/read-csv
(java.io.StringReader. "some field\tanother-field\ta third field")
:separator \tab)
;= (["some field" "another-field" "a third field"])
(Use something like (with-open [rdr (clojure.java.io/reader f)] (vec (csv/read-csv rdr :separator \tab))) to read data from the TSV file f.)
If you don't want to do it by hand you could use a CSV library, e.g.:
https://github.com/clojure/data.csv
https://github.com/davidsantiago/clojure-csv
Then you'd be on the save side if your requirements change (e.g. you want to allow spaces in values, the delimiter changes, you want quoting, ...) since you could easily adapt. However, directly splitting single lines works, too:
(require '[clojure.java.io :as io]
'[clojure.string :as string])
(with-open [rd (io/reader (io/file "/path/to/file"))]
(->> (line-seq rd)
(map #(.split ^String % "\t"))
(mapv vec)))
Still, I'd go with a library if I were you.

How i can deserialize record structure from file, already saved to file with print-dup?

I'm have a following code:
(use 'clojure.java.io)
(defrecord Member [id name salary role])
(defrecord Role [id name])
(def member-records (ref ()))
(defn add-member [member]
(dosync (alter member-records conj member)))
;;Test-data -->
(def dev-r(->Role 1 "Developer"))
(def test-member1(->Member 1 "Kirill" 70000.00 dev-r))
;;Test-data <--
(defn save-data-2-file []
(with-open [wrtr (writer "C:/Platform/Work/test.cdf")]
(print-dup #member-records wrtr)))
(defn process-line [line]
(println line))
;;Test line content
;;#BTC.pcost.Member{:id 1, :name "Kirill", :salary 70000.0, :role #BTC.pcost.Role{:id 1, :name "Developer"}})
(defn load-data-from-file []
(with-open [rdr (reader "C:/Platform/Work/test.cdf")]
(doseq [line (line-seq rdr)]
(process-line line))))
I'm want to recreate records after reading file, but i can not understand how i can make it. Yes, i'm know that i can parse text and fill my structure by the elements of parsed line, but it's will be difficult, cause i'm have alot structs like "Member" and "Role". Can anyone to suggest me a way, that i can do?
You can use read-string, and slurp, to pull the records out of the file. read-string is limited to reading the first form of a string, but, from your sample, you are only storing a single form, as a list of records.
(defn load-data-from-file [file]
(read-string (slurp file)))
Lazy Reading
If you need more than the first form, or cannot read the entire stream into memory, you can use read directly, to make a lazy reader.
(defn lazy-read
([rdr] (let [eof (Object.)] (lazy-read rdr (read rdr false eof) eof)))
([rdr data eof]
(if (not= eof data)
(cons data (lazy-seq (lazy-read rdr (read rdr false eof) eof))))))
(defn load-all-data [file]
(with-open [rdr (java.io.PushbackReader. (reader file))]
(doall (lazy-read rdr))))
(load-all-data "C:/Platform/Work/test.cdf")
Security
Also, it is good to mention security when loading code with read-string or read. You should only use them with trusted sources, because, using #= or a Java constructor, the source can execute arbitrary code inside your application. For a longer explanation, take a look at the documentation for read.
Setting *read-eval* to false would prevent the issue, but it would also prevent the reconstruction of the records in your sample. To avoid the issue all together, you can use the clojure.edn/read and clojure.edn/read-string functions, with a whitelist of readers.
(defn edn-read [eof rdr]
(clojure.edn/read {:eof eof :readers {'BTC.pcost.Role map->Role
'BTC.pcost.Member map->Member}}
rdr))
(defn lazy-edn-read
([rdr] (let [eof (Object.)] (lazy-edn-read rdr (edn-read eof rdr) eof)))
([rdr data eof]
(if (not= eof data)
(cons data (lazy-seq (lazy-edn-read rdr (edn-read eof rdr) eof))))))
(defn load-all-data [file]
(with-open [rdr (java.io.PushbackReader. (reader file))]
(doall (take-while (complement nil?) (lazy-edn-read rdr)))))
(load-all-data "C:/Platform/Work/test.cdf")
You can use read.
This function will read one object from a file:
(defn load-data-from-file [filename]
(with-open [rdr (java.io.PushbackReader. (reader filename))]
(read rdr)))
Or this will read all objects from the file:
(defn load-all-data-from-file [filename]
(let [eof (Object.)]
(with-open [rdr (java.io.PushbackReader. (reader filename))]
(doall
(take-while #(not= % eof)
(repeatedly #(read rdr nil eof)))))))
Here's the API documentation for read.
This is a small variation that will read all objects from a string:
(defn load-all-data-from-string [string]
(let [eof (Object.)]
(with-open [rdr (-> string java.io.StringReader. java.io.PushbackReader.)]
(doall
(take-while #(not= % eof)
(repeatedly #(read rdr nil eof)))))))
This is, as far as I know, not possible to do using read-string. Instead we use read with a java.io.StringReader.