Using proper functional style in a file processing task

Using proper functional style in a file processing task - clojure

I have an input csv file and need to generate an output file that has one line for each input line. Each input line could be of a specific type (say "old" or "new") that can be determined only by processing the input line.
In addition to generating the output file, we also want to print the summary of how many lines of each type were in the input file. My actual task involves generating different SQLs based on the input line type, but to keep the example code focussed, I have kept the processing in the function proc-line simple. The function func determines what type an input line is -- again, I have kept it simple by randomly generating a type. The actual logic is more involved.
I have the following code and it does the job. However, to retain a functional style for the task of generating the summary, I chose to return a keyword to signify the type of each line and created a lazy sequence of these for generating the final summary. In an imperative style, we would simply increment a count for each line type. Generating a potentially large collection just for summarizing seems inefficient. Another consequence of the way I have coded it is the repetition of the (.write writer ...) portion. Ideally, I would code that just once.
Any suggestions for eliminating the two problems I have identified (and others)?
(ns file-proc.core
(:gen-class)
(:require [clojure.data.csv :as csv]
[clojure.java.io :as io]))
(defn func [x]
(rand-nth [true false]))
(defn proc-line [line writer]
(if (func line)
(do (.write writer (str line "\n")) :new)
(do (.write writer (str (reverse line) "\n")) :old)))
(defn generate-report [from to]
(with-open
[reader (io/reader from)
writer (io/writer to)]
(->> (csv/read-csv reader)
(rest)
(map #(proc-line % writer))
(frequencies)
(doall))))

I'd try to separate data processing from side-effects like reading/writing files. Hopefully this would allow the IO operations to stay at opposite boundaries of the pipeline, and the "middle" processing logic is agnostic of where the input comes from and where the output is going.
(defn rand-bool [] (rand-nth [true false]))
(defn proc-line [line]
(if (rand-bool)
[line :new]
[(reverse line) :old]))
proc-line no longer takes a writer, it only cares about the line and it returns a vector/2-tuple of the processed line along with a keyword. It doesn't concern itself with string formatting either—we should let csv/write-csv do that. Now you could do something like this:
(defn process-lines [reader]
(->> (csv/read-csv reader)
(rest)
(map proc-line)))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (process-lines reader)]
(csv/write-csv writer (map first lines))
(frequencies (map second lines)))))
This will work but it's going to realize/keep the entire input sequence in memory, which you don't want for large files. We need a way to keep this pipeline lazy/efficient, but we also have to produce two "streams" from one and in a single pass: the processed lines only to be sent to write-csv, and each line's metadata for calculating frequencies. One "easy" way to do this is to introduce some mutability to track the metadata frequencies as the lazy sequence is consumed by write-csv:
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [freqs (atom {})]
(->> (csv/read-csv reader)
;; processing starts
(rest)
(map (fn [line]
(let [[row tag] (proc-line line)]
(swap! freqs update tag (fnil inc 0))
row)))
;; processing ends
(csv/write-csv writer))
#freqs)))
I removed the process-lines call to make the full pipeline more apparent. By the time write-csv has fully (and lazily) consumed its payload, freqs will be a map like {:old 23, :new 31} which will be the return value of generate-report. There's room for improvement/generalization, but I think this is a start.

As others have mentioned, separating writing and processing work would be ideal. Here's how I usually do this:
(defn product-type [p]
(rand-nth [:new :old]))
(defn row->product [row]
(let [p (zipmap [:id :name :price] row)]
(assoc p :type (product-type p))))
(defmulti to-csv :type)
(defmethod to-csv :new [product] ...)
(defmethod to-csv :old [product] ...)
(defn generate-report [from to]
(with-open [rdr (io/reader from)
wrtr (io/writer to)]
(->> (rest (csv/read-csv rdr))
(map row->product)
(map #(do (.write wrtr (to-csv %)) %))
(map :type)
(frequencies)
(doall))))
(The code might not work—didn't run it, sorry.)
Constructing a hash-map and using multimethods is optional, of course, but it's better to assign a product its type first. This way its data dictates what pipeline is doing, not proc-line.

To refactor the code we need the safety net of at least one characterization test for generate-report. Since that function does file I/O (we will make the code independent from I/O later), we will use this sample CSV file, f1.csv:
Year,Code
1997,A
2000,B
2010,C
1996,D
2001,E
We cannot yet write a test because function func uses a RNG, so we rewrite it to be deterministic by actually looking at the input. While there, we rename it to new?, which is more representative of the problem:
(defn new? [row]
(>= (Integer/parseInt (first row)) 2000))
where, for the sake of the exercise, we assume that a row is "new" if the Year column is >= 2000.
We can now write the test and see it pass (here for brevity we focus only on the frequency calculation, not on the output transformation):
(deftest characterization-as-posted
(is (= {:old 2, :new 3}
(generate-report "f1.csv" "f1.tmp"))))
And now to the refactoring. The main idea is to realize that we need an accumulator, replacing map with reduce and getting rid of frequencies and of doall. Also, we rename "line" with "row", since this is how a line is called in the CSV format:
(defn generate-report [from to] ; 1
(let [[old new _] ; 2
(with-open [reader (io/reader from) ; 3
writer (io/writer to)] ; 4
(->> (csv/read-csv reader) ; 5
(rest) ; 6
(reduce process-row [0 0 writer])))] ; 7
{:old old :new new})) ; 8
The new process-row (originally process-line) becomes:
(defn process-row [[old new writer] row]
(if (new? row)
(do (.write writer (str row "\n")) [old (inc new) writer])
(do (.write writer (str (reverse row) "\n")) [(inc old) new writer])))
Function process-row, as any function to be passed to reduce, has two arguments: first argument [old new writer] is a vector of two accumulators and of the I/O writer (the vector is destructured); second argument row is one element of the collection that is being reduced. It returns the new vector of accumulators, that at the end of the collection is destructured in line 2 of generate-report and used at line 8 to create a hashmap equivalent to the one previously returned by frequencies.
We can do one last refactoring: separate the file I/O from the business logic, so that we can write tests without the scaffolding of preparated input files, as follows.
Function process-row becomes:
(defn process-row [[old-cnt new-cnt writer] row]
(let [[out-row old new] (process-row-pure old-cnt new-cnt row)]
(do (.write writer out-row)
[old new writer])))
and the business logic can be done by the pure (and so easily testable) function:
(defn process-row-pure [old new row]
(if (new? row)
[(str row "\n") old (inc new)]
[(str (reverse row) "\n") (inc old) new]))
All this without mutating anything.

IMHO, I would separate the two different aspects: counting the frequencies and writing to a file:
(defn count-lines
([lines] (count-lines lines 0 0))
([lines count-old count-new]
(if-let [line (first lines)]
(if (func line)
(recur count-old (inc count-new) (rest lines))
(recur (inc count-old) count-new (rest lines)))
{:new count-new :old count-old})))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (rest (csv/read-csv reader))
frequencies (count-lines lines)]
(doseq [line lines]
(.write writer (str line "\n"))))))

Related

counting lines in a file with a filter with clojure

I'm trying to figure out what is wrong with my code here. Basically the idea behind it is that I am reading a very large file and at the end of each line in the file is a number. I want to count the number of lines that have the number at the end greater than 500.
What I have is this and on paper it should work, but something is going wrong and I keep returning nil.
(defn countlines [] (with-open [rdr (clojure.java.io/reader "myfile.txt")]
(doseq [line (line-seq rdr)]
(count (re-find #"(?!500)[56789]\d{2,}|\d{4,}$" line)))))

the reason is that you use doseq:
clojure.core/doseq
[seq-exprs & body]
Macro
Added in 1.0
Repeatedly executes body (presumably for side-effects) with
bindings and filtering as provided by "for". Does not retain
the head of the sequence. Returns nil.
you should probably rewrite it to something like (doall (for [line (line-seq rdr)] ...
but to fulfill your task you need to rewrite it (because your function would return a seq of counts of chars in matches:
user> (count (re-find #"\d+" "123k456"))
3
which is obviously not what you want
what you need to do is:
(count (filter #(re-find #"(?!500)[56789]\d{2,}|\d{4,}$" %)
(line-seq rdr)))

If I understand the question correctly, you should be doing something like this:
(defn countlines [] (with-open [rdr (clojure.java.io/reader "myfile.txt")]
(-> (line-seq rdr)
(filter #(re-find #"(?!500)[56789]\d{2,}|\d{4,}$" %))
(count))))

About Martin Lechner's answer, I think should use Thread last(->>) rather than Thread first(->). So it should be
(defn countlines [] (with-open [rdr (clojure.java.io/reader "myfile.txt")] (->> (line-seq rdr)
(filter #(re-find #"(?!500)[56789]\d{2,}|\d{4,}$" %))
(count))))

Clojure : Get "OutOfMemoryError Java heap space" when parsing big log file

all.
I want to parse big log files using Clojure.
And the structure of each line record is "UserID,Lantitude,Lontitude,Timestamp".
My implemented steps are:
----> Read log file & Get top-n user list
----> Find each top-n user's records and store in separate log file (UserID.log) .
The implement source code :
;======================================================
(defn parse-file
""
[file n]
(with-open [rdr (io/reader file)]
(println "001 begin with open ")
(let [lines (line-seq rdr)
res (parse-recur lines)
sorted
(into (sorted-map-by (fn [key1 key2]
(compare [(get res key2) key2]
[(get res key1) key1])))
res)]
(println "Statistic result : " res)
(println "Top-N User List : " sorted)
(find-write-recur lines sorted n)
)))
(defn parse-recur
""
[lines]
(loop [ls lines
res {}]
(if ls
(recur (next ls)
(update-res res (first ls)))
res)))
(defn update-res
""
[res line]
(let [params (string/split line #",")
id (if (> (count params) 1) (params 0) "0")]
(if (res id)
(update-in res [id] inc)
(assoc res id 1))))
(defn find-write-recur
"Get each users' records and store into separate log file"
[lines sorted n]
(loop [x n
sd sorted
id (first (keys sd))]
(if (and (> x 0) sd)
(do (create-write-file id
(find-recur lines id))
(recur (dec x)
(rest sd)
(nth (keys sd) 1))))))
(defn find-recur
""
[lines id]
(loop [ls lines
res []]
(if ls
(recur (next ls)
(update-vec res id (first ls)))
res)))
(defn update-vec
""
[res id line]
(let [params (string/split line #",")
id_ (if (> (count params) 1) (params 0) "0")]
(if (= id id_ )
(conj res line)
res)))
(defn create-write-file
"Create a new file and write information into the file."
([file info-lines]
(with-open [wr (io/writer (str MAIN-PATH file))]
(doseq [line info-lines] (.write wr (str line "\n")))
))
([file info-lines append?]
(with-open [wr (io/writer (str MAIN-PATH file) :append append?)]
(doseq [line info-lines] (.write wr (str line "\n"))))
))
;======================================================
I tested this clj in REPL with command (parse-file "./DATA/log.log" 3), and get the results:
Records-----Size-----Time----Result
1,000-------42KB-----<1s-----OK
10,000------420KB----<1s-----OK
100,000-----4.3MB----3s------OK
1,000,000---43MB-----15s-----OK
6,000,000---258MB---->20M----"OutOfMemoryError Java heap space java.lang.String.substring (String.java:1913)"
======================================================
Here is the question:
1. how can i fix the error when i try to parse big log file , like > 200MB
2. how can i optimize the function to run faster ?
3. there are logs more than 1G size , how can the function deal with it.
I am still new to Clojure, any suggestion or solution will be appreciate~
Thanks

As a direct answer to your questions; from a little Clojure experience.
The quick and dirty fix for running out of memory boils down to giving the JVM more memory. You can try adding this to your project.clj:
:jvm-opts ["-Xmx1G"] ;; or more
That will make Leiningen launch the JVM with a higher memory cap.
This kind of work is going to use a lot of memory no matter how you work it. #Vidya's suggestion ot use a library is definitely worth considering. However, there's one optimization that you can make that should help a little.
Whenever you're dealing with your (line-seq ...) object (a lazy sequence) you should make sure to maintain it as a lazy seq. Doing next on it will pull the whole thing into memory at once. Use rest instead. Take a look at the clojure site, especially the section on laziness:
(rest aseq) - returns a possibly empty seq, never nil
[snip]
a (possibly) delayed path to the remaining items, if any
You may even want to traverse the log twice--once to pull just the username from each line as a lazy-seq, again to filter out those users. This will minimize the amount of the file you're holding onto at any one time.
Making sure your function is lazy should reduce the sheer overhead that having the file as a sequence in memory creates. Whether that's enough to parse a 1G file, I don't think I can say.

You definitely don't need Cascalog or Hadoop simply to parse a file which doesn't fit into your Java heap. This SO question provides some working examples of how to process large files lazily. The main point is you need to keep the file open while you traverse the lazy seq. Here is what worked for me in a similar situation:
(defn lazy-file-lines [file]
(letfn [(helper [rdr]
(lazy-seq
(if-let [line (.readLine rdr)]
(cons line (helper rdr))
(do (.close rdr) nil))))]
(helper (clojure.java.io/reader file))))
You can map, reduce, count, etc. over this lazy sequence:
(count (lazy-file-lines "/tmp/massive-file.txt"))
;=> <a large integer>
The parsing is a separate, simpler problem.

I am also relatively new to Clojure, so there are no obvious optimizations I can see. Hopefully others more experienced can offer some advice. But I feel like this is simply a matter of the data size being too big for the tools at hand.
For that reason, I would suggest using Cascalog, an abstraction over Hadoop or your local machine using Clojure. I think the syntax for querying big log files would be pretty straightforward for you.

Clojure error - GC overhead limit exceeded

I'm trying to randomly sample a large FASTQ file and write it to standard out. I keep getting 'GC overhead limit exceeded' errors and I'm not sure what I'm doing wrong. I've tried increasing Xmx in leiningen but that didn't help. Here is my code:
(ns fastq-sample.core
(:gen-class)
(:use clojure.java.io))
(def n-read-pair-lines 8)
(defn sample? [sample-rate]
(> sample-rate (rand)))
;
; Agent for writing the reads asynchronously
;
(def wtr (agent (writer *out*)))
(defn write-out [r]
(letfn [(write [out msg] (.write out msg) out)]
(send wtr write r)))
(defn write-close []
(send wtr #(.close %))
(await wtr))
;
; Main
;
(defn reads [file]
(->>
(input-stream file)
(java.util.zip.GZIPInputStream.)
(reader)
(line-seq)))
(defn -main [fastq-file sample-rate-str]
(let [sample-rate (Float. sample-rate-str)
in-reads (partition n-read-pair-lines (reads fastq-file))]
(doseq [x (filter (fn [_] (sample? sample-rate)) in-reads)]
(write-out (clojure.string/join "\n" x)))
(write-close)
(shutdown-agents)))

This is the same symptom I often get when I try to merge an infinite sequence into a simgle data structure like a map or vector. It very often means that memory was tight and the garbage collector could not keep up with demand for new objects. Most likely the wtr agent is too large for memory. Perhaps you may want to not store the printed results in the atom by changing
(write [out msg] (.write out msg) out)
to
(write [out msg] (.write out msg))

How i can deserialize record structure from file, already saved to file with print-dup?

I'm have a following code:
(use 'clojure.java.io)
(defrecord Member [id name salary role])
(defrecord Role [id name])
(def member-records (ref ()))
(defn add-member [member]
(dosync (alter member-records conj member)))
;;Test-data -->
(def dev-r(->Role 1 "Developer"))
(def test-member1(->Member 1 "Kirill" 70000.00 dev-r))
;;Test-data <--
(defn save-data-2-file []
(with-open [wrtr (writer "C:/Platform/Work/test.cdf")]
(print-dup #member-records wrtr)))
(defn process-line [line]
(println line))
;;Test line content
;;#BTC.pcost.Member{:id 1, :name "Kirill", :salary 70000.0, :role #BTC.pcost.Role{:id 1, :name "Developer"}})
(defn load-data-from-file []
(with-open [rdr (reader "C:/Platform/Work/test.cdf")]
(doseq [line (line-seq rdr)]
(process-line line))))
I'm want to recreate records after reading file, but i can not understand how i can make it. Yes, i'm know that i can parse text and fill my structure by the elements of parsed line, but it's will be difficult, cause i'm have alot structs like "Member" and "Role". Can anyone to suggest me a way, that i can do?

You can use read-string, and slurp, to pull the records out of the file. read-string is limited to reading the first form of a string, but, from your sample, you are only storing a single form, as a list of records.
(defn load-data-from-file [file]
(read-string (slurp file)))
Lazy Reading
If you need more than the first form, or cannot read the entire stream into memory, you can use read directly, to make a lazy reader.
(defn lazy-read
([rdr] (let [eof (Object.)] (lazy-read rdr (read rdr false eof) eof)))
([rdr data eof]
(if (not= eof data)
(cons data (lazy-seq (lazy-read rdr (read rdr false eof) eof))))))
(defn load-all-data [file]
(with-open [rdr (java.io.PushbackReader. (reader file))]
(doall (lazy-read rdr))))
(load-all-data "C:/Platform/Work/test.cdf")
Security
Also, it is good to mention security when loading code with read-string or read. You should only use them with trusted sources, because, using #= or a Java constructor, the source can execute arbitrary code inside your application. For a longer explanation, take a look at the documentation for read.
Setting *read-eval* to false would prevent the issue, but it would also prevent the reconstruction of the records in your sample. To avoid the issue all together, you can use the clojure.edn/read and clojure.edn/read-string functions, with a whitelist of readers.
(defn edn-read [eof rdr]
(clojure.edn/read {:eof eof :readers {'BTC.pcost.Role map->Role
'BTC.pcost.Member map->Member}}
rdr))
(defn lazy-edn-read
([rdr] (let [eof (Object.)] (lazy-edn-read rdr (edn-read eof rdr) eof)))
([rdr data eof]
(if (not= eof data)
(cons data (lazy-seq (lazy-edn-read rdr (edn-read eof rdr) eof))))))
(defn load-all-data [file]
(with-open [rdr (java.io.PushbackReader. (reader file))]
(doall (take-while (complement nil?) (lazy-edn-read rdr)))))
(load-all-data "C:/Platform/Work/test.cdf")

You can use read.
This function will read one object from a file:
(defn load-data-from-file [filename]
(with-open [rdr (java.io.PushbackReader. (reader filename))]
(read rdr)))
Or this will read all objects from the file:
(defn load-all-data-from-file [filename]
(let [eof (Object.)]
(with-open [rdr (java.io.PushbackReader. (reader filename))]
(doall
(take-while #(not= % eof)
(repeatedly #(read rdr nil eof)))))))
Here's the API documentation for read.
This is a small variation that will read all objects from a string:
(defn load-all-data-from-string [string]
(let [eof (Object.)]
(with-open [rdr (-> string java.io.StringReader. java.io.PushbackReader.)]
(doall
(take-while #(not= % eof)
(repeatedly #(read rdr nil eof)))))))
This is, as far as I know, not possible to do using read-string. Instead we use read with a java.io.StringReader.

More idiomatic line-by-line handling of a file in Clojure

I'm trying to read a file that (may or may not) have YAML frontmatter line-by-line using Clojure, and return a hashmap with two vectors, one containing the frontmatter lines and one containing everything else (i.e., the body).
And example input file would look like this:
---
key1: value1
key2: value2
---
Body text paragraph 1
Body text paragraph 2
Body text paragraph 3
I have functioning code that does this, but to my (admittedly inexperienced with Clojure) nose, it reeks of code smell.
(defn process-file [f]
(with-open [rdr (java.io.BufferedReader. (java.io.FileReader. f))]
(loop [lines (line-seq rdr) in-fm 0 frontmatter [] body []]
(if-not (empty? lines)
(let [line (string/trim (first lines))]
(cond
(zero? (count line))
(recur (rest lines) in-fm frontmatter body)
(and (< in-fm 2) (= line "---"))
(recur (rest lines) (inc in-fm) frontmatter body)
(= in-fm 1)
(recur (rest lines) in-fm (conj frontmatter line) body)
:else
(recur (rest lines) in-fm frontmatter (conj body line))))
(hash-map :frontmatter frontmatter :body body)))))
Can someone point me to a more elegant way to do this? I'm going to be doing a decent amount of line-by-line parsing in this project, and I'd like a more idiomatic way of going about it if possible.

Firstly, I'd put line-processing logic in its own function to be called from a function actually reading in the files. Better yet, you can make the function dealing with IO take a function to map over the lines as an argument, perhaps along these lines:
(require '[clojure.java.io :as io])
(defn process-file-with [f filename]
(with-open [rdr (io/reader (io/file filename))]
(f (line-seq rdr))))
Note that this arrangement makes it the duty of f to realize as much of the line seq as it needs before it returns (because afterwards with-open will close the underlying reader of the line seq).
Given this division of responsibilities, the line processing function might look like this, assuming the first --- must be the first non-blank line and all blank lines are to be skipped (as they would be when using the code from the question text):
(require '[clojure.string :as string])
(defn process-lines [lines]
(let [ls (->> lines
(map string/trim)
(remove string/blank?))]
(if (= (first ls) "---")
(let [[front sep-and-body] (split-with #(not= "---" %) (next ls))]
{:front (vec front) :body (vec (next sep-and-body))})
{:body (vec ls)})))
Note the calls to vec which cause all the lines to be read in and returned in a vector or pair of vectors (so that we can use process-lines with process-file-with without the reader being closed too soon).
Because reading lines from an actual file on disk is now decoupled from processing a seq of lines, we can easily test the latter part of the process at the REPL (and of course this can be made into a unit test):
;; could input this as a single string and split, of course
(def test-lines
["---"
"key1: value1"
"key2: value2"
"---"
""
"Body text paragraph 1"
""
"Body text paragraph 2"
""
"Body text paragraph 3"])
Calling our function now:
user> (process-lines test-lines)
{:front ("key1: value1" "key2: value2"),
:body ("Body text paragraph 1"
"Body text paragraph 2"
"Body text paragraph 3")}

actually, the idiomatic way to do it using clojure would be to avoid returning 'a hashmap with two vectors' and treat the file as a (lazy) sequence of lines
then, the function that will process the sequence of lines decides whether the file has a YAML frontmatter or not
something like this:
(use '[clojure.java.io :only (reader)])
(let [s (line-seq (reader "YOURFILENAMEHERE"))]
(if (= "---\n" (take 1 (line-seq (reader "YOURFILENAMEHERE"))))
(process-seq-with-frontmatter s)
(process-seq-without-frontmatter s))
by the way, this is a quit and dirty solution; two things to improve:
notice I'm creating two seqs for the same file, it would be better to create just one and make the inspection of the first line so that it wouldn't traverse over the first element of the seq (like a peek instead of a pop)
I think it would be cleaner to have a multimethod 'process-seq' (with a better name of course) that would dispatch based on the content of the first line of the seq

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using proper functional style in a file processing task - clojure

Related

counting lines in a file with a filter with clojure

Clojure : Get "OutOfMemoryError Java heap space" when parsing big log file

Clojure error - GC overhead limit exceeded

How i can deserialize record structure from file, already saved to file with print-dup?

More idiomatic line-by-line handling of a file in Clojure

Categories

Resources