Create a mini database in Closure from a txt file - clojure

I am really new to clojure and im having trouble with the following;
I am trying to read data from a txt entry that can have strings in this format:
1|John Smith|123 Here Street|456-4567
2|Sue Jones|43 Rose Court Street|345-7867
3|Fan Yuhong|165 Happy Lane|345-4533
I figured out how to read and split the data with the following code
(with-open [reader (clojure.java.io/reader "cust.txt")]
(vec (for [line (line-seq reader)] ; iterate over each line
(->> (clojure.string/split line #"\|") ; split it by "|"
(remove empty?)
(zipmap[:custID :name :address :phoneNumber]))))) ; turn into a map
The issue is with this zipmap, I am having trouble accessing individual values like custID.
(get zipmap :custID)
The code above returns nil
How should I change my code so that I can access a customer's data?

First, do not name the return variable zipmap as it's a reserved function name. Let's call it ret.
ret is a vector of maps. You can't (get ret :custID), which is why you are getting nil. If you want all the custID, use (map :custID ret). Or maybe you want to index your vector by custID, in which case create (def ret2 (zipmap (map :custID ret) ret)) and then you can easily find the second entry with (get ret2 2).

Related

Why do I get a RuntimeException when I try to fill file lines into a vector in Clojure?

I am very new to the language and I try to get used to Clojure.
I want to read a file which contains Strings like:
prefix/FirstEntry
prefix/SecondEntry
prefix/ThirdEntry
I want to fill a vector with one line per field. I also need to get rid of the prefix. For this I read the file and replace the prefix/ with an empty String "".
(defn save-clean-lines [the-file] [the-prefix]
(def vc-file (read-and-cut-file the-file the-prefix))
)
(defn read-and-cut-file
[file, pref]
(with-open[rdr (clojure.java.io/reader file)]
(reduce conj [] (line-seq (rdr/replace pref ""))) ;;return a vector with deleted prefix
)
When I test it I get the exception:
java.lang.RuntimeException: Unable to resolve symbol: read-and-cut-file in this context on (def vc-file (read-and-cut-file the-file the-prefix))
Why is that the case?
(->> "data.txt"
(slurp)
(str/split-lines)
(into [] (map (fn [s] (str/replace s #"^prefix/" "")))))

streaming multiple output lines from a single input line using Clojure data.csv

I am reading a csv file, processing the input, appending the output to the input and writing the results an output csv. Seems pretty straight-forward. I am using Clojure data.csv. However, I ran into a nuance in the output that does not fit with anything I've run into before with Clojure and I cannot figure it out. The output will contain 0 to N lines for each input, and I cannot figure out how to stream this down to the calling fn.
Here is the form that is processing the file:
(defn process-file
[from to]
(let [ctr (atom 0)]
(with-open [r (io/reader from)
w (io/writer to)]
(some->> (csv/read-csv r)
(map #(process-line % ctr))
(csv/write-csv w)))))
And here is the form that processes each line (that returns 0 to N lines that each need to be written to the output csv):
(defn process-line
[line ctr]
(swap! ctr inc)
(->> (apps-for-org (first line))
(reduce #(conj %1 (add-results-to-input line %2)) [])))
Honestly, I didn't fully understand your question, but from your comment, I seem to have answered it.
If you want to run csv/write-csv for each row returned, you could just map over the rows:
(some->> (csv/read-csv r)
(map #(process-line % ctr))
(mapv #(csv/write-csv w %))))
Note my use of mapv since you're running side effects. Depending on the context, if you used just map, the laziness may prevent the writes from happening.
It would be arguably more correct to use doseq however:
(let [rows (some->> (csv/read-csv r)
(map #(process-line % ctr)))]
(doseq [row rows]
(csv/write-csv w row)))
doseq makes it clear that the intent of the iteration is to carry out side effects and not to produce a new (immutable) list.
I think the problem you're running into is that your map function process-line returns a collection of zero-to-many rows, so when you map over the input lines you get a collection of collections, when you just want one collection (of rows) to send to write-csv. If that's true, the fix is simply changing this line to use mapcat:
(mapcat #(process-line % ctr))
mapcat is like map, but it will "merge" the resulting collections of each process-line call into a single collection.

Using proper functional style in a file processing task

I have an input csv file and need to generate an output file that has one line for each input line. Each input line could be of a specific type (say "old" or "new") that can be determined only by processing the input line.
In addition to generating the output file, we also want to print the summary of how many lines of each type were in the input file. My actual task involves generating different SQLs based on the input line type, but to keep the example code focussed, I have kept the processing in the function proc-line simple. The function func determines what type an input line is -- again, I have kept it simple by randomly generating a type. The actual logic is more involved.
I have the following code and it does the job. However, to retain a functional style for the task of generating the summary, I chose to return a keyword to signify the type of each line and created a lazy sequence of these for generating the final summary. In an imperative style, we would simply increment a count for each line type. Generating a potentially large collection just for summarizing seems inefficient. Another consequence of the way I have coded it is the repetition of the (.write writer ...) portion. Ideally, I would code that just once.
Any suggestions for eliminating the two problems I have identified (and others)?
(ns file-proc.core
(:gen-class)
(:require [clojure.data.csv :as csv]
[clojure.java.io :as io]))
(defn func [x]
(rand-nth [true false]))
(defn proc-line [line writer]
(if (func line)
(do (.write writer (str line "\n")) :new)
(do (.write writer (str (reverse line) "\n")) :old)))
(defn generate-report [from to]
(with-open
[reader (io/reader from)
writer (io/writer to)]
(->> (csv/read-csv reader)
(rest)
(map #(proc-line % writer))
(frequencies)
(doall))))
I'd try to separate data processing from side-effects like reading/writing files. Hopefully this would allow the IO operations to stay at opposite boundaries of the pipeline, and the "middle" processing logic is agnostic of where the input comes from and where the output is going.
(defn rand-bool [] (rand-nth [true false]))
(defn proc-line [line]
(if (rand-bool)
[line :new]
[(reverse line) :old]))
proc-line no longer takes a writer, it only cares about the line and it returns a vector/2-tuple of the processed line along with a keyword. It doesn't concern itself with string formatting either—we should let csv/write-csv do that. Now you could do something like this:
(defn process-lines [reader]
(->> (csv/read-csv reader)
(rest)
(map proc-line)))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (process-lines reader)]
(csv/write-csv writer (map first lines))
(frequencies (map second lines)))))
This will work but it's going to realize/keep the entire input sequence in memory, which you don't want for large files. We need a way to keep this pipeline lazy/efficient, but we also have to produce two "streams" from one and in a single pass: the processed lines only to be sent to write-csv, and each line's metadata for calculating frequencies. One "easy" way to do this is to introduce some mutability to track the metadata frequencies as the lazy sequence is consumed by write-csv:
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [freqs (atom {})]
(->> (csv/read-csv reader)
;; processing starts
(rest)
(map (fn [line]
(let [[row tag] (proc-line line)]
(swap! freqs update tag (fnil inc 0))
row)))
;; processing ends
(csv/write-csv writer))
#freqs)))
I removed the process-lines call to make the full pipeline more apparent. By the time write-csv has fully (and lazily) consumed its payload, freqs will be a map like {:old 23, :new 31} which will be the return value of generate-report. There's room for improvement/generalization, but I think this is a start.
As others have mentioned, separating writing and processing work would be ideal. Here's how I usually do this:
(defn product-type [p]
(rand-nth [:new :old]))
(defn row->product [row]
(let [p (zipmap [:id :name :price] row)]
(assoc p :type (product-type p))))
(defmulti to-csv :type)
(defmethod to-csv :new [product] ...)
(defmethod to-csv :old [product] ...)
(defn generate-report [from to]
(with-open [rdr (io/reader from)
wrtr (io/writer to)]
(->> (rest (csv/read-csv rdr))
(map row->product)
(map #(do (.write wrtr (to-csv %)) %))
(map :type)
(frequencies)
(doall))))
(The code might not work—didn't run it, sorry.)
Constructing a hash-map and using multimethods is optional, of course, but it's better to assign a product its type first. This way its data dictates what pipeline is doing, not proc-line.
To refactor the code we need the safety net of at least one characterization test for generate-report. Since that function does file I/O (we will make the code independent from I/O later), we will use this sample CSV file, f1.csv:
Year,Code
1997,A
2000,B
2010,C
1996,D
2001,E
We cannot yet write a test because function func uses a RNG, so we rewrite it to be deterministic by actually looking at the input. While there, we rename it to new?, which is more representative of the problem:
(defn new? [row]
(>= (Integer/parseInt (first row)) 2000))
where, for the sake of the exercise, we assume that a row is "new" if the Year column is >= 2000.
We can now write the test and see it pass (here for brevity we focus only on the frequency calculation, not on the output transformation):
(deftest characterization-as-posted
(is (= {:old 2, :new 3}
(generate-report "f1.csv" "f1.tmp"))))
And now to the refactoring. The main idea is to realize that we need an accumulator, replacing map with reduce and getting rid of frequencies and of doall. Also, we rename "line" with "row", since this is how a line is called in the CSV format:
(defn generate-report [from to] ; 1
(let [[old new _] ; 2
(with-open [reader (io/reader from) ; 3
writer (io/writer to)] ; 4
(->> (csv/read-csv reader) ; 5
(rest) ; 6
(reduce process-row [0 0 writer])))] ; 7
{:old old :new new})) ; 8
The new process-row (originally process-line) becomes:
(defn process-row [[old new writer] row]
(if (new? row)
(do (.write writer (str row "\n")) [old (inc new) writer])
(do (.write writer (str (reverse row) "\n")) [(inc old) new writer])))
Function process-row, as any function to be passed to reduce, has two arguments: first argument [old new writer] is a vector of two accumulators and of the I/O writer (the vector is destructured); second argument row is one element of the collection that is being reduced. It returns the new vector of accumulators, that at the end of the collection is destructured in line 2 of generate-report and used at line 8 to create a hashmap equivalent to the one previously returned by frequencies.
We can do one last refactoring: separate the file I/O from the business logic, so that we can write tests without the scaffolding of preparated input files, as follows.
Function process-row becomes:
(defn process-row [[old-cnt new-cnt writer] row]
(let [[out-row old new] (process-row-pure old-cnt new-cnt row)]
(do (.write writer out-row)
[old new writer])))
and the business logic can be done by the pure (and so easily testable) function:
(defn process-row-pure [old new row]
(if (new? row)
[(str row "\n") old (inc new)]
[(str (reverse row) "\n") (inc old) new]))
All this without mutating anything.
IMHO, I would separate the two different aspects: counting the frequencies and writing to a file:
(defn count-lines
([lines] (count-lines lines 0 0))
([lines count-old count-new]
(if-let [line (first lines)]
(if (func line)
(recur count-old (inc count-new) (rest lines))
(recur (inc count-old) count-new (rest lines)))
{:new count-new :old count-old})))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (rest (csv/read-csv reader))
frequencies (count-lines lines)]
(doseq [line lines]
(.write writer (str line "\n"))))))

Filter (dedupe) and concat csv files

A tip request please:
How can I concat a set of large csv files into one. I need rows identified as duplicates removed (i.e. filter (some #{s} (get row 1) ) Each file has no duplicates, actually, only between the files can duplicate rows appear. The order of the final outputs isn't crucial, but matching a sequential scan of the files would be preferred.
The total number of ids to maintain is about 150,000,000, so maintaining a set that large in memory is doable, I think.
So, I've got a a fn that takes a filename and a set of ids to avoid and returns a filtered sequence of rows. I've also got a vector of filenames to process. I can't wrap my head around how to output the filtered rows to a single file while conj the ids from each filtered set of rows into an existing set.
(defn open-seq "read file f and filter rows based on set s" [f s]
(letfn [(iset? [x]
(let [ls (s/split x #", ")
id (read-string (get ls 1))]
(not (some #{id} s))))]
(with-open [in (io/reader f)]
(->> (line-seq in)
(filter iset?)
; shortcut (take 20)
doall)
))
)
EDIT:
This is a two-pass solution.
(defn proc [infiles outfile]
(with-open [outf (io/writer outfile)]
(let [s (atom #{})]
(doseq [infile infiles]
(with-open [in (io/reader infile)]
(doseq [line (open-seq in #s)]
(.write outf line)
(.newLine outf)))
(with-open [in (io/reader infile)]
(let [ids (->> (open-seq in #s)
(map (fn [x] (get x 1))))]
(swap! s conj ids)
))
))))
I suppose I could conj each id onto the set atom with each line. I guess that had a preconceived notion that conjing the whole seq of ids would be more idiomatic.

clojure read large text file and count occurrences

I'm trying to read a large text file and count occurrences of specific errors.
For example, for the following sample text
something
bla
error123
foo
test
error123
line
junk
error55
more
stuff
I want to end up with (don't really care what data structure although I am thinking a map)
error123 - 2
error55 - 1
Here is what I have tried so far
(require '[clojure.java.io :as io])
(defn find-error [line]
(if (re-find #"error" line)
line))
(defn read-big-file [func, filename]
(with-open [rdr (io/reader filename)]
(doall (map func (line-seq rdr)))))
calling it like this
(read-big-file find-error "sample.txt")
returns:
(nil nil "error123" nil nil "error123" nil nil "error55" nil nil)
Next I tried to remove the nil values and group like items
(group-by identity (remove #(= nil %) (read-big-file find-error "sample.txt")))
which returns
{"error123" ["error123" "error123"], "error55" ["error55"]}
This is getting close to the desired output, although it may not be efficient. How can I get the counts now? Also,as someone new to clojure and functional programming I would appreciate any suggestions on how I might improve this.
thanks!
I think you might be looking for the frequencies function:
user=> (doc frequencies)
-------------------------
clojure.core/frequencies
([coll])
Returns a map from distinct items in coll to the number of times
they appear.
nil
So, this should give you what you want:
(frequencies (remove nil? (read-big-file find-error "sample.txt")))
;;=> {"error123" 2, "error55" 1}
If your text file is really large, however, I would recommend doing this on the line-seq inline to ensure you don't run out of memory. This way you can also use a filter rather than map and remove.
(defn count-lines [pred, filename]
(with-open [rdr (io/reader filename)]
(frequencies (filter pred (line-seq rdr)))))
(defn is-error-line? [line]
(re-find #"error" line))
(count-lines is-error-line? "sample.txt")
;; => {"error123" 2, "error55" 1}