clojure read large text file and count occurrences

clojure read large text file and count occurrences - clojure

I'm trying to read a large text file and count occurrences of specific errors.
For example, for the following sample text
something
bla
error123
foo
test
error123
line
junk
error55
more
stuff
I want to end up with (don't really care what data structure although I am thinking a map)
error123 - 2
error55 - 1
Here is what I have tried so far
(require '[clojure.java.io :as io])
(defn find-error [line]
(if (re-find #"error" line)
line))
(defn read-big-file [func, filename]
(with-open [rdr (io/reader filename)]
(doall (map func (line-seq rdr)))))
calling it like this
(read-big-file find-error "sample.txt")
returns:
(nil nil "error123" nil nil "error123" nil nil "error55" nil nil)
Next I tried to remove the nil values and group like items
(group-by identity (remove #(= nil %) (read-big-file find-error "sample.txt")))
which returns
{"error123" ["error123" "error123"], "error55" ["error55"]}
This is getting close to the desired output, although it may not be efficient. How can I get the counts now? Also,as someone new to clojure and functional programming I would appreciate any suggestions on how I might improve this.
thanks!

I think you might be looking for the frequencies function:
user=> (doc frequencies)
-------------------------
clojure.core/frequencies
([coll])
Returns a map from distinct items in coll to the number of times
they appear.
nil
So, this should give you what you want:
(frequencies (remove nil? (read-big-file find-error "sample.txt")))
;;=> {"error123" 2, "error55" 1}
If your text file is really large, however, I would recommend doing this on the line-seq inline to ensure you don't run out of memory. This way you can also use a filter rather than map and remove.
(defn count-lines [pred, filename]
(with-open [rdr (io/reader filename)]
(frequencies (filter pred (line-seq rdr)))))
(defn is-error-line? [line]
(re-find #"error" line))
(count-lines is-error-line? "sample.txt")
;; => {"error123" 2, "error55" 1}

Related

How to put the records of my file into a defined list and use it after in my code? - CLOJURE

(defn loadData [filename]
(with-open [rdr (io/reader filename)]
(doseq [line (line-seq rdr)
:let [ [id & data]
(clojure.string/split line #"\|")] ]
[
(Integer/parseInt id)
data
])
))
My file's content:
1|Xxx|info1|222-2222
2|Yyy|info2|333-3333
3|Zzz|info3|444-4444
How do I get this function to return a list like :
{
{1, Xxx, info1, 222-2222}
{2, Yyy, info2, 333-3333}
{3, Zzz, info3, 444-4444}
}
I also want to know how to get my function loadData to return this list above instead of NIL.
Please help! Thanks in advance...

(def my-file-content
(.getBytes "1|Xxx|info1|222-2222\n2|Yyy|info2|333-3333\n3|Zzz|info3|444-4444"))
;; See https://guide.clojure.style/ for common naming style.
(defn load-data
[filename]
;; Well done, `with-open` is correctly used here!
(with-open [rdr (io/reader filename)]
;; See https://clojuredocs.org/clojure.core/doseq for examples of `for` and `doseq`. After reading examples and documentation, it will probably become clear why `for` is what you want, and not `doseq`.
(for [line (line-seq rdr)
:let [[id & data] (clojure.string/split line #"\|")]]
[(Integer/parseInt id)
data])))
;; But `(load-data my-file-content)` raises an exception because `for` returns a lazy sequence. Nothing happens before you actually display the result, so the course of actions is:
;; - Open a reader;
;; - Define computations in a `LazySeq` as result of `for`;
;; - Close the reader;
;; - The function returns the lazy seq;
;; - Your REPL wants to display it, so it tries to evaluate the first elements;
;; - Now it actually tries to read from the reader, but it is closed.
;; This lazyness explains why this returns no exception (you don't look at what's inside the sequence):
(type (load-data my-file-content))
;; => clojure.lang.LazySeq
;; The shortest fix is:
(defn load-data
[filename]
(with-open [rdr (io/reader filename)]
;; https://clojuredocs.org/clojure.core/doall
(doall
(for [line (line-seq rdr)
:let [[id & data] (clojure.string/split line #"\|")]]
[(Integer/parseInt id)
data]))))
(load-data my-file-content)
;; => ([1 ("Xxx" "info1" "222-2222")] [2 ("Yyy" "info2" "333-3333")] [3 ("Zzz" "info3" "444-4444")])
;; This is not what you want. Here is the shortest fix to coerce this (realised) lazy sequence into what you want:
(defn load-data
[filename]
(with-open [rdr (io/reader filename)]
;; https://clojuredocs.org/clojure.core/doall
(doall
(for [line (line-seq rdr)
:let [[id & data] (clojure.string/split line #"\|")]]
(cons (Integer/parseInt id) data)))))
(load-data my-file-content)
;; => '((1 "Xxx" "info1" "222-2222") (2 "Yyy" "info2" "333-3333") (3 "Zzz" "info3" "444-4444"))
;; Also, note that a list is represented as `'(1 2 3)` or `[1 2 3]` for a vector. `{1 2 3}` is a syntax error, curly brackets are for maps.
(defn load-data
[filename]
(with-open [rdr (io/reader filename)]
(->> (line-seq rdr)
(mapv #(clojure.string/split % #"\|")))))
(load-data my-file-content)
;; => [["1" "Xxx" "info1" "222-2222"] ["2" "Yyy" "info2" "333-3333"] ["3" "Zzz" "info3" "444-4444"]]
;; Note that here we no longer have a list of lists, but a vector of vectors. In Clojure lists are lazy and vectors are not, so this is why `mapv`, which returns a vector, works fine with an open reader.

Using proper functional style in a file processing task

I have an input csv file and need to generate an output file that has one line for each input line. Each input line could be of a specific type (say "old" or "new") that can be determined only by processing the input line.
In addition to generating the output file, we also want to print the summary of how many lines of each type were in the input file. My actual task involves generating different SQLs based on the input line type, but to keep the example code focussed, I have kept the processing in the function proc-line simple. The function func determines what type an input line is -- again, I have kept it simple by randomly generating a type. The actual logic is more involved.
I have the following code and it does the job. However, to retain a functional style for the task of generating the summary, I chose to return a keyword to signify the type of each line and created a lazy sequence of these for generating the final summary. In an imperative style, we would simply increment a count for each line type. Generating a potentially large collection just for summarizing seems inefficient. Another consequence of the way I have coded it is the repetition of the (.write writer ...) portion. Ideally, I would code that just once.
Any suggestions for eliminating the two problems I have identified (and others)?
(ns file-proc.core
(:gen-class)
(:require [clojure.data.csv :as csv]
[clojure.java.io :as io]))
(defn func [x]
(rand-nth [true false]))
(defn proc-line [line writer]
(if (func line)
(do (.write writer (str line "\n")) :new)
(do (.write writer (str (reverse line) "\n")) :old)))
(defn generate-report [from to]
(with-open
[reader (io/reader from)
writer (io/writer to)]
(->> (csv/read-csv reader)
(rest)
(map #(proc-line % writer))
(frequencies)
(doall))))

I'd try to separate data processing from side-effects like reading/writing files. Hopefully this would allow the IO operations to stay at opposite boundaries of the pipeline, and the "middle" processing logic is agnostic of where the input comes from and where the output is going.
(defn rand-bool [] (rand-nth [true false]))
(defn proc-line [line]
(if (rand-bool)
[line :new]
[(reverse line) :old]))
proc-line no longer takes a writer, it only cares about the line and it returns a vector/2-tuple of the processed line along with a keyword. It doesn't concern itself with string formatting either—we should let csv/write-csv do that. Now you could do something like this:
(defn process-lines [reader]
(->> (csv/read-csv reader)
(rest)
(map proc-line)))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (process-lines reader)]
(csv/write-csv writer (map first lines))
(frequencies (map second lines)))))
This will work but it's going to realize/keep the entire input sequence in memory, which you don't want for large files. We need a way to keep this pipeline lazy/efficient, but we also have to produce two "streams" from one and in a single pass: the processed lines only to be sent to write-csv, and each line's metadata for calculating frequencies. One "easy" way to do this is to introduce some mutability to track the metadata frequencies as the lazy sequence is consumed by write-csv:
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [freqs (atom {})]
(->> (csv/read-csv reader)
;; processing starts
(rest)
(map (fn [line]
(let [[row tag] (proc-line line)]
(swap! freqs update tag (fnil inc 0))
row)))
;; processing ends
(csv/write-csv writer))
#freqs)))
I removed the process-lines call to make the full pipeline more apparent. By the time write-csv has fully (and lazily) consumed its payload, freqs will be a map like {:old 23, :new 31} which will be the return value of generate-report. There's room for improvement/generalization, but I think this is a start.

As others have mentioned, separating writing and processing work would be ideal. Here's how I usually do this:
(defn product-type [p]
(rand-nth [:new :old]))
(defn row->product [row]
(let [p (zipmap [:id :name :price] row)]
(assoc p :type (product-type p))))
(defmulti to-csv :type)
(defmethod to-csv :new [product] ...)
(defmethod to-csv :old [product] ...)
(defn generate-report [from to]
(with-open [rdr (io/reader from)
wrtr (io/writer to)]
(->> (rest (csv/read-csv rdr))
(map row->product)
(map #(do (.write wrtr (to-csv %)) %))
(map :type)
(frequencies)
(doall))))
(The code might not work—didn't run it, sorry.)
Constructing a hash-map and using multimethods is optional, of course, but it's better to assign a product its type first. This way its data dictates what pipeline is doing, not proc-line.

To refactor the code we need the safety net of at least one characterization test for generate-report. Since that function does file I/O (we will make the code independent from I/O later), we will use this sample CSV file, f1.csv:
Year,Code
1997,A
2000,B
2010,C
1996,D
2001,E
We cannot yet write a test because function func uses a RNG, so we rewrite it to be deterministic by actually looking at the input. While there, we rename it to new?, which is more representative of the problem:
(defn new? [row]
(>= (Integer/parseInt (first row)) 2000))
where, for the sake of the exercise, we assume that a row is "new" if the Year column is >= 2000.
We can now write the test and see it pass (here for brevity we focus only on the frequency calculation, not on the output transformation):
(deftest characterization-as-posted
(is (= {:old 2, :new 3}
(generate-report "f1.csv" "f1.tmp"))))
And now to the refactoring. The main idea is to realize that we need an accumulator, replacing map with reduce and getting rid of frequencies and of doall. Also, we rename "line" with "row", since this is how a line is called in the CSV format:
(defn generate-report [from to] ; 1
(let [[old new _] ; 2
(with-open [reader (io/reader from) ; 3
writer (io/writer to)] ; 4
(->> (csv/read-csv reader) ; 5
(rest) ; 6
(reduce process-row [0 0 writer])))] ; 7
{:old old :new new})) ; 8
The new process-row (originally process-line) becomes:
(defn process-row [[old new writer] row]
(if (new? row)
(do (.write writer (str row "\n")) [old (inc new) writer])
(do (.write writer (str (reverse row) "\n")) [(inc old) new writer])))
Function process-row, as any function to be passed to reduce, has two arguments: first argument [old new writer] is a vector of two accumulators and of the I/O writer (the vector is destructured); second argument row is one element of the collection that is being reduced. It returns the new vector of accumulators, that at the end of the collection is destructured in line 2 of generate-report and used at line 8 to create a hashmap equivalent to the one previously returned by frequencies.
We can do one last refactoring: separate the file I/O from the business logic, so that we can write tests without the scaffolding of preparated input files, as follows.
Function process-row becomes:
(defn process-row [[old-cnt new-cnt writer] row]
(let [[out-row old new] (process-row-pure old-cnt new-cnt row)]
(do (.write writer out-row)
[old new writer])))
and the business logic can be done by the pure (and so easily testable) function:
(defn process-row-pure [old new row]
(if (new? row)
[(str row "\n") old (inc new)]
[(str (reverse row) "\n") (inc old) new]))
All this without mutating anything.

IMHO, I would separate the two different aspects: counting the frequencies and writing to a file:
(defn count-lines
([lines] (count-lines lines 0 0))
([lines count-old count-new]
(if-let [line (first lines)]
(if (func line)
(recur count-old (inc count-new) (rest lines))
(recur (inc count-old) count-new (rest lines)))
{:new count-new :old count-old})))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (rest (csv/read-csv reader))
frequencies (count-lines lines)]
(doseq [line lines]
(.write writer (str line "\n"))))))

Clojure lazy-eval inside apply

I'm trying to read in a file line by line and concatenate a new string to the end of each line. For testing I've done this:
(defn read-file
[filename]
(with-open [rdr (clojure.java.io/reader filename)]
(doall (line-seq rdr))))
(apply str ["asdfasdf" (doall (take 1 (read-file filename)))])
If I just evaluate (take 1 (read-file filename)) in a repl, I get the first line of the file. However, when I try to evaluate what I did above, I get "asdfasdfclojure.lang.LazySeq#4be5d1db".
Can anyone explain how to forcefully evaluate take to get it to not return the lazy sequence?

The take function is lazy by design, so you may have to realize the values you want, using first, next, or nth, or operate on the entire seq with functions like apply, reduce, vec, or into.
In your case, it looks like you are trying to do the following:
(apply str ["asdfasdf" (apply str (take 1 (read-file filename)))])
Or:
(str "asdfasdf" (first (read-file filename)))
You can also realize the entire lazyseq using doall. Just keep in mind, a realized lazy seq is still a seq.
(realized? (take 1 (read-file filename))) ;; => false
(type (take 1 (read-file filename))) ;; => clojure.lang.LazySeq
(realized? (doall (take 1 (read-file filename)))) ;; => true
(type (doall (take 1 (read-file filename)))) ;; => clojure.lang.LazySeq
A better option would be to apply your transformations lazily, using something like map, and select the values you want from the resulting seq. (Like stream processing.)
(first (map #(str "prefix" % "suffix")
(read-file filename)))
Note: map is lazy, so it will return an unrealized LazySeq.

More idiomatic line-by-line handling of a file in Clojure

I'm trying to read a file that (may or may not) have YAML frontmatter line-by-line using Clojure, and return a hashmap with two vectors, one containing the frontmatter lines and one containing everything else (i.e., the body).
And example input file would look like this:
---
key1: value1
key2: value2
---
Body text paragraph 1
Body text paragraph 2
Body text paragraph 3
I have functioning code that does this, but to my (admittedly inexperienced with Clojure) nose, it reeks of code smell.
(defn process-file [f]
(with-open [rdr (java.io.BufferedReader. (java.io.FileReader. f))]
(loop [lines (line-seq rdr) in-fm 0 frontmatter [] body []]
(if-not (empty? lines)
(let [line (string/trim (first lines))]
(cond
(zero? (count line))
(recur (rest lines) in-fm frontmatter body)
(and (< in-fm 2) (= line "---"))
(recur (rest lines) (inc in-fm) frontmatter body)
(= in-fm 1)
(recur (rest lines) in-fm (conj frontmatter line) body)
:else
(recur (rest lines) in-fm frontmatter (conj body line))))
(hash-map :frontmatter frontmatter :body body)))))
Can someone point me to a more elegant way to do this? I'm going to be doing a decent amount of line-by-line parsing in this project, and I'd like a more idiomatic way of going about it if possible.

Firstly, I'd put line-processing logic in its own function to be called from a function actually reading in the files. Better yet, you can make the function dealing with IO take a function to map over the lines as an argument, perhaps along these lines:
(require '[clojure.java.io :as io])
(defn process-file-with [f filename]
(with-open [rdr (io/reader (io/file filename))]
(f (line-seq rdr))))
Note that this arrangement makes it the duty of f to realize as much of the line seq as it needs before it returns (because afterwards with-open will close the underlying reader of the line seq).
Given this division of responsibilities, the line processing function might look like this, assuming the first --- must be the first non-blank line and all blank lines are to be skipped (as they would be when using the code from the question text):
(require '[clojure.string :as string])
(defn process-lines [lines]
(let [ls (->> lines
(map string/trim)
(remove string/blank?))]
(if (= (first ls) "---")
(let [[front sep-and-body] (split-with #(not= "---" %) (next ls))]
{:front (vec front) :body (vec (next sep-and-body))})
{:body (vec ls)})))
Note the calls to vec which cause all the lines to be read in and returned in a vector or pair of vectors (so that we can use process-lines with process-file-with without the reader being closed too soon).
Because reading lines from an actual file on disk is now decoupled from processing a seq of lines, we can easily test the latter part of the process at the REPL (and of course this can be made into a unit test):
;; could input this as a single string and split, of course
(def test-lines
["---"
"key1: value1"
"key2: value2"
"---"
""
"Body text paragraph 1"
""
"Body text paragraph 2"
""
"Body text paragraph 3"])
Calling our function now:
user> (process-lines test-lines)
{:front ("key1: value1" "key2: value2"),
:body ("Body text paragraph 1"
"Body text paragraph 2"
"Body text paragraph 3")}

actually, the idiomatic way to do it using clojure would be to avoid returning 'a hashmap with two vectors' and treat the file as a (lazy) sequence of lines
then, the function that will process the sequence of lines decides whether the file has a YAML frontmatter or not
something like this:
(use '[clojure.java.io :only (reader)])
(let [s (line-seq (reader "YOURFILENAMEHERE"))]
(if (= "---\n" (take 1 (line-seq (reader "YOURFILENAMEHERE"))))
(process-seq-with-frontmatter s)
(process-seq-without-frontmatter s))
by the way, this is a quit and dirty solution; two things to improve:
notice I'm creating two seqs for the same file, it would be better to create just one and make the inspection of the first line so that it wouldn't traverse over the first element of the seq (like a peek instead of a pop)
I think it would be cleaner to have a multimethod 'process-seq' (with a better name of course) that would dispatch based on the content of the first line of the seq

How to use clojure.edn/read to get a sequence of objects in a file?

Clojure 1.5 introduced clojure.edn, which includes a read function that requires a PushbackReader.
If I want to read the first five objects, I can do:
(with-open [infile (java.io.PushbackReader. (clojure.java.io/reader "foo.txt"))]
(binding [*in* infile]
(let [edn-seq (repeatedly clojure.edn/read)]
(dorun (take 5 (map println edn-seq))))))
How can I instead print out all of the objects? Considering that some of them may be nils, it seems like I need to check for the EOF, or something similar. I want to have a sequence of objects similar to what I would get from line-seq.

Use :eof key
http://clojure.github.com/clojure/clojure.edn-api.html
opts is a map that can include the following keys: :eof - value to
return on end-of-file. When not supplied, eof throws an exception.
edit: sorry, that wasn't enough detail! here y'go:
(with-open [in (java.io.PushbackReader. (clojure.java.io/reader "foo.txt"))]
(let [edn-seq (repeatedly (partial edn/read {:eof :theend} in))]
(dorun (map println (take-while (partial not= :theend) edn-seq)))))
that should do it

I looked at this again. Here is what I came up with:
(defn edn-seq
"Returns the objects from stream as a lazy sequence."
([]
(edn-seq *in*))
([stream]
(edn-seq {} stream))
([opts stream]
(lazy-seq (cons (clojure.edn/read opts stream) (edn-seq opts stream)))))
(defn swallow-eof
"Ignore an EOF exception raised when consuming seq."
[seq]
(-> (try
(cons (first seq) (swallow-eof (rest seq)))
(catch java.lang.RuntimeException e
(when-not (= (.getMessage e) "EOF while reading")
(throw e))))
lazy-seq))
(with-open [stream (java.io.PushbackReader. (clojure.java.io/reader "foo.txt"))]
(dorun (map println (swallow-eof (edn-seq stream)))))
edn-seq has the same signature as clojure.edn/read, and preserves all of the existing behavior, which I think is important given that people may use the :eof option in different ways. A separate function to contain the EOF exception seemed like a better choice, though I'm not sure how best to capture it since it shows up just as a java.lang.RuntimeException.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

clojure read large text file and count occurrences - clojure

Related

How to put the records of my file into a defined list and use it after in my code? - CLOJURE

Using proper functional style in a file processing task

Clojure lazy-eval inside apply

More idiomatic line-by-line handling of a file in Clojure

How to use clojure.edn/read to get a sequence of objects in a file?

Categories

Resources