Clojure lazy-eval inside apply - clojure

I'm trying to read in a file line by line and concatenate a new string to the end of each line. For testing I've done this:
(defn read-file
[filename]
(with-open [rdr (clojure.java.io/reader filename)]
(doall (line-seq rdr))))
(apply str ["asdfasdf" (doall (take 1 (read-file filename)))])
If I just evaluate (take 1 (read-file filename)) in a repl, I get the first line of the file. However, when I try to evaluate what I did above, I get "asdfasdfclojure.lang.LazySeq#4be5d1db".
Can anyone explain how to forcefully evaluate take to get it to not return the lazy sequence?

The take function is lazy by design, so you may have to realize the values you want, using first, next, or nth, or operate on the entire seq with functions like apply, reduce, vec, or into.
In your case, it looks like you are trying to do the following:
(apply str ["asdfasdf" (apply str (take 1 (read-file filename)))])
Or:
(str "asdfasdf" (first (read-file filename)))
You can also realize the entire lazyseq using doall. Just keep in mind, a realized lazy seq is still a seq.
(realized? (take 1 (read-file filename))) ;; => false
(type (take 1 (read-file filename))) ;; => clojure.lang.LazySeq
(realized? (doall (take 1 (read-file filename)))) ;; => true
(type (doall (take 1 (read-file filename)))) ;; => clojure.lang.LazySeq
A better option would be to apply your transformations lazily, using something like map, and select the values you want from the resulting seq. (Like stream processing.)
(first (map #(str "prefix" % "suffix")
(read-file filename)))
Note: map is lazy, so it will return an unrealized LazySeq.

Related

Using proper functional style in a file processing task

I have an input csv file and need to generate an output file that has one line for each input line. Each input line could be of a specific type (say "old" or "new") that can be determined only by processing the input line.
In addition to generating the output file, we also want to print the summary of how many lines of each type were in the input file. My actual task involves generating different SQLs based on the input line type, but to keep the example code focussed, I have kept the processing in the function proc-line simple. The function func determines what type an input line is -- again, I have kept it simple by randomly generating a type. The actual logic is more involved.
I have the following code and it does the job. However, to retain a functional style for the task of generating the summary, I chose to return a keyword to signify the type of each line and created a lazy sequence of these for generating the final summary. In an imperative style, we would simply increment a count for each line type. Generating a potentially large collection just for summarizing seems inefficient. Another consequence of the way I have coded it is the repetition of the (.write writer ...) portion. Ideally, I would code that just once.
Any suggestions for eliminating the two problems I have identified (and others)?
(ns file-proc.core
(:gen-class)
(:require [clojure.data.csv :as csv]
[clojure.java.io :as io]))
(defn func [x]
(rand-nth [true false]))
(defn proc-line [line writer]
(if (func line)
(do (.write writer (str line "\n")) :new)
(do (.write writer (str (reverse line) "\n")) :old)))
(defn generate-report [from to]
(with-open
[reader (io/reader from)
writer (io/writer to)]
(->> (csv/read-csv reader)
(rest)
(map #(proc-line % writer))
(frequencies)
(doall))))
I'd try to separate data processing from side-effects like reading/writing files. Hopefully this would allow the IO operations to stay at opposite boundaries of the pipeline, and the "middle" processing logic is agnostic of where the input comes from and where the output is going.
(defn rand-bool [] (rand-nth [true false]))
(defn proc-line [line]
(if (rand-bool)
[line :new]
[(reverse line) :old]))
proc-line no longer takes a writer, it only cares about the line and it returns a vector/2-tuple of the processed line along with a keyword. It doesn't concern itself with string formatting either—we should let csv/write-csv do that. Now you could do something like this:
(defn process-lines [reader]
(->> (csv/read-csv reader)
(rest)
(map proc-line)))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (process-lines reader)]
(csv/write-csv writer (map first lines))
(frequencies (map second lines)))))
This will work but it's going to realize/keep the entire input sequence in memory, which you don't want for large files. We need a way to keep this pipeline lazy/efficient, but we also have to produce two "streams" from one and in a single pass: the processed lines only to be sent to write-csv, and each line's metadata for calculating frequencies. One "easy" way to do this is to introduce some mutability to track the metadata frequencies as the lazy sequence is consumed by write-csv:
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [freqs (atom {})]
(->> (csv/read-csv reader)
;; processing starts
(rest)
(map (fn [line]
(let [[row tag] (proc-line line)]
(swap! freqs update tag (fnil inc 0))
row)))
;; processing ends
(csv/write-csv writer))
#freqs)))
I removed the process-lines call to make the full pipeline more apparent. By the time write-csv has fully (and lazily) consumed its payload, freqs will be a map like {:old 23, :new 31} which will be the return value of generate-report. There's room for improvement/generalization, but I think this is a start.
As others have mentioned, separating writing and processing work would be ideal. Here's how I usually do this:
(defn product-type [p]
(rand-nth [:new :old]))
(defn row->product [row]
(let [p (zipmap [:id :name :price] row)]
(assoc p :type (product-type p))))
(defmulti to-csv :type)
(defmethod to-csv :new [product] ...)
(defmethod to-csv :old [product] ...)
(defn generate-report [from to]
(with-open [rdr (io/reader from)
wrtr (io/writer to)]
(->> (rest (csv/read-csv rdr))
(map row->product)
(map #(do (.write wrtr (to-csv %)) %))
(map :type)
(frequencies)
(doall))))
(The code might not work—didn't run it, sorry.)
Constructing a hash-map and using multimethods is optional, of course, but it's better to assign a product its type first. This way its data dictates what pipeline is doing, not proc-line.
To refactor the code we need the safety net of at least one characterization test for generate-report. Since that function does file I/O (we will make the code independent from I/O later), we will use this sample CSV file, f1.csv:
Year,Code
1997,A
2000,B
2010,C
1996,D
2001,E
We cannot yet write a test because function func uses a RNG, so we rewrite it to be deterministic by actually looking at the input. While there, we rename it to new?, which is more representative of the problem:
(defn new? [row]
(>= (Integer/parseInt (first row)) 2000))
where, for the sake of the exercise, we assume that a row is "new" if the Year column is >= 2000.
We can now write the test and see it pass (here for brevity we focus only on the frequency calculation, not on the output transformation):
(deftest characterization-as-posted
(is (= {:old 2, :new 3}
(generate-report "f1.csv" "f1.tmp"))))
And now to the refactoring. The main idea is to realize that we need an accumulator, replacing map with reduce and getting rid of frequencies and of doall. Also, we rename "line" with "row", since this is how a line is called in the CSV format:
(defn generate-report [from to] ; 1
(let [[old new _] ; 2
(with-open [reader (io/reader from) ; 3
writer (io/writer to)] ; 4
(->> (csv/read-csv reader) ; 5
(rest) ; 6
(reduce process-row [0 0 writer])))] ; 7
{:old old :new new})) ; 8
The new process-row (originally process-line) becomes:
(defn process-row [[old new writer] row]
(if (new? row)
(do (.write writer (str row "\n")) [old (inc new) writer])
(do (.write writer (str (reverse row) "\n")) [(inc old) new writer])))
Function process-row, as any function to be passed to reduce, has two arguments: first argument [old new writer] is a vector of two accumulators and of the I/O writer (the vector is destructured); second argument row is one element of the collection that is being reduced. It returns the new vector of accumulators, that at the end of the collection is destructured in line 2 of generate-report and used at line 8 to create a hashmap equivalent to the one previously returned by frequencies.
We can do one last refactoring: separate the file I/O from the business logic, so that we can write tests without the scaffolding of preparated input files, as follows.
Function process-row becomes:
(defn process-row [[old-cnt new-cnt writer] row]
(let [[out-row old new] (process-row-pure old-cnt new-cnt row)]
(do (.write writer out-row)
[old new writer])))
and the business logic can be done by the pure (and so easily testable) function:
(defn process-row-pure [old new row]
(if (new? row)
[(str row "\n") old (inc new)]
[(str (reverse row) "\n") (inc old) new]))
All this without mutating anything.
IMHO, I would separate the two different aspects: counting the frequencies and writing to a file:
(defn count-lines
([lines] (count-lines lines 0 0))
([lines count-old count-new]
(if-let [line (first lines)]
(if (func line)
(recur count-old (inc count-new) (rest lines))
(recur (inc count-old) count-new (rest lines)))
{:new count-new :old count-old})))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (rest (csv/read-csv reader))
frequencies (count-lines lines)]
(doseq [line lines]
(.write writer (str line "\n"))))))

Find out where the error happened in Clojure

For the most part I understand what Clojure is telling me with it's error messages. But I am still clueless as to find out where the error happened.
Here is an example of what I mean
(defn extract [m]
(keys m))
(defn multiple [xs]
(map #(* 2 %) xs))
(defn process [xs]
(-> xs
(multiple) ; seq -> seq
(extract))) ; map -> seq ... fails
(process [1 2 3])
Statically typed languages would now tell me that I tried to pass a sequence to a function that expects a map on line X. And Clojure does this in a way:
ClassCastException java.lang.Long cannot be cast to java.util.Map$Entry
But I still have no idea where the error happened. Obviously for this instance it's easy because there are just 3 functions involved, you can easily just read through all of them but as programs grow bigger this gets old very quickly.
Is there a way find out where the errors happened other than just proof reading the code from top to bottom? (which is my current approach)
You can use clojure.spec. It is still in alpha, and there's still a bunch of tooling support coming (hopefully), but instrumenting functions works well.
(ns foo.core
(:require
;; For clojure 1.9.0-alpha16 and higher, it is called spec.alpha
[clojure.spec.alpha :as s]
[clojure.spec.test.alpha :as stest]))
;; Extract takes a map and returns a seq
(s/fdef extract
:args (s/cat :m map?)
:ret seq?)
(defn extract [m]
(keys m))
;; multiple takes a coll of numbers and returns a coll of numbers
(s/fdef multiple
:args (s/cat :xs (s/coll-of number?))
:ret (s/coll-of number?))
(defn multiple [xs]
(map #(* 2 %) xs))
(defn process [xs]
(-> xs
(multiple) ; seq -> seq
(extract))) ; map -> seq ... fails
;; This needs to come after the definition of the specs,
;; but before the call to process.
;; This is something I imagine can be handled automatically
;; by tooling at some point.
(stest/instrument)
;; The println is to force evaluation.
;; If not it wouldn't run because it's lazy and
;; not used for anything.
(println (process [1 2 3]))
Running this file prints (among other info):
Call to #'foo.core/extract did not conform to spec: In: [0] val: (2
4 6) fails at: [:args :m] predicate: map? :clojure.spec.alpha/spec
#object[clojure.spec.alpha$regex_spec_impl$reify__1200 0x2b935f0d
"clojure.spec.alpha$regex_spec_impl$reify__1200#2b935f0d"]
:clojure.spec.alpha/value ((2 4 6)) :clojure.spec.alpha/args ((2 4
6)) :clojure.spec.alpha/failure :instrument
:clojure.spec.test.alpha/caller {:file "core.clj", :line 29,
:var-scope foo.core/process}
Which can be read as: A call to exctract failed because the value passed in (2 4 6) failed the predicate map?. That call happened in the file "core.clj" at line 29.
A caveat that trips people up is that instrument only checks function arguments and not return values. This is a (strange if you ask me) design decision from Rich Hickey. There's a library for that, though.
If you have a REPL session you can print a stack trace:
(clojure.stacktrace/print-stack-trace *e 30)
See http://puredanger.github.io/tech.puredanger.com/2010/02/17/clojure-stack-trace-repl/ for various different ways of printing the stack trace. You will need to have a dependency such as this in your project.clj:
[org.clojure/tools.namespace "0.2.11"]
I didn't get a stack trace using the above method, however just typing *e at the REPL will give you all the available information about the error, which to be honest didn't seem very helpful.
For the rare cases where the stack trace is not helpful I usually debug using a call to a function that returns the single argument it is given, yet has the side effect of printing that argument. I happen to call this function probe. In your case it can be put at multiple places in the threading macro.
Re-typing your example I have:
(defn extract [m]
(keys m))
(defn multiply [xs]
(mapv #(* 2 %) xs))
(defn process [xs]
(-> xs
(multiply) ; seq -> seq
(extract))) ; map -> seq ... fails ***line 21***
(println (process [1 2 3]))
;=> java.lang.ClassCastException: java.lang.Long cannot be cast
to java.util.Map$Entry, compiling:(tst/clj/core.clj:21:21)
So we get a good clue in the exception where is says the file and line/col number tst.clj.core.clj:21:21 that the extract method is the problem.
Another indispensible tool I use is Plumatic Schema to inject "gradual" type checking into clojure. The code becomes:
(ns tst.clj.core
(:use clj.core tupelo.test)
(:require
[tupelo.core :as t]
[tupelo.schema :as tsk]
[schema.core :as s]))
(t/refer-tupelo)
(t/print-versions)
(s/defn extract :- [s/Any]
[m :- tsk/Map]
(keys m))
(s/defn multiply :- [s/Num]
[xs :- [s/Num]]
(mapv #(* 2 %) xs))
(s/defn process :- s/Any
[xs :- [s/Num]]
(-> xs
(multiply) ; seq -> seq
(extract))) ; map -> seq ... fails
(println (process [1 2 3]))
clojure.lang.ExceptionInfo: Input to extract does not match schema:
[(named (not (map? [2 4 6])) m)] {:type :schema.core/error, :schema [#schema.core.One{:schema {Any Any},
:optional? false, :name m}],
:value [[2 4 6]], :error [(named (not (map? [2 4 6])) m)]},
compiling:(tst/clj/core.clj:23:17)
So, while the format of the error message is a bit lengthy, it tells right away that we passed a parameter of the wrong type and/or shape into the method extract.
Note that you need a line like this:
(s/set-fn-validation! true) ; enforce fn schemas
I create a special file test/tst/clj/_bootstrap.clj so it is always in the same place.
For more information on Plumatic Schema please see:
https://github.com/plumatic/schema
https://youtu.be/o_jtwIs2Ot8
https://github.com/plumatic/schema/wiki/Basics-Examples
https://github.com/plumatic/schema/wiki/Defining-New-Schema-Types-1.0

How to print each elements of a hash map list using map function in clojure?

I am constructing a list of hash maps which is then passed to another function. When I try to print each hash maps from the list using map it is not working. I am able to print the full list or get the first element etc.
(defn m [a]
(println a)
(map #(println %) a))
The following works from the repl only.
(m (map #(hash-map :a %) [1 2 3]))
But from the program that I load using load-file it is not working. I am seeing the a but not its individual elements. What's wrong?
In Clojure tranform functions return a lazy sequence. So, (map #(println %) a) return a lazy sequence. When consumed, the map action is applied and only then the print-side effect is visible.
If the purpose of the function is to have a side effect, like printing, you need to eagerly evaluate the transformation. The functions dorun and doall
(def a [1 2 3])
(dorun (map #(println %) a))
; returns nil
(doall (map #(println %) a))
; returns the collection
If you actually don't want to map, but only have a side effect, you can use doseq. It is intended to 'iterate' to do side effects:
(def a [1 2 3])
(doseq [i a]
(println i))
If your goal is simply to call an existing function on every item in a collection in order, ignoring the returned values, then you should use run!:
(run! println [1 2 3])
;; 1
;; 2
;; 3
;;=> nil
In some more complicated cases it may be preferable to use doseq as #Gamlor suggests, but in this case, doseq only adds boilerplate.
I recommend to use tail recursion:
(defn printList [a]
(let [head (first a)
tail (rest a)]
(when (not (nil? head))
(println head)
(printList tail))))

clojure.core/map isn't working

I am trying to figure out why one of my map calls isn't working. I am building a crawler with the purpose of learning Clojure.
(use '[clojure.java.io])
(defn md5
"Generate a md5 checksum for the given string"
[token]
(let [hash-bytes
(doto (java.security.MessageDigest/getInstance "MD5")
(.reset)
(.update (.getBytes token)))]
(.toString
(new java.math.BigInteger 1 (.digest hash-bytes)) ; Positive and the size of the number
16)))
(defn full-url [url base]
(if (re-find #"^http[s]{0,1}://" url)
url
(apply str "http://" base (if (= \/ (first url))
url
(apply str "/" url)))))
(defn get-domain-from-url [url]
(let [matcher (re-matcher #"http[s]{0,1}://([^/]*)/{0,1}" url)
domain-match (re-find matcher)]
(nth domain-match 1)))
(defn crawl [url]
(do
(println "-----------------------------------\n")
(if (.exists (clojure.java.io/as-file (apply str "theinternet/page" (md5 url))))
(println (apply str url " already crawled ... skiping \n"))
(let [domain (get-domain-from-url url)
text (slurp url)
matcher (re-matcher #"<a[^>]*href\s*=\s*[\"\']([^\"\']*)[\"\'][^>]*>(.*)</a\s*>" text)]
(do
(spit (apply str "theinternet/page" (md5 url)) text)
(loop [urls []
a-tag (re-find matcher)]
(if a-tag
(let [u (nth a-tag 1)]
(recur (conj urls (full-url u domain)) (re-find matcher)))
(do
(println (apply str "parsed: " url))
(println (apply str (map (fn [u]
(apply str "-----> " u "\n")) urls)))
(map crawl urls)))))))))
(defn -main
"I don't do a whole lot ... yet."
[& args]
(crawl "http://www.example.com/"))
First call to map works:
(println (apply str (map (fn [u]
(apply str "-----> " u "\n")) urls)))
But the second call (map crawl urls) seems to be ignored.
The crawl function is working as intended, slurping the url, parsing with the regex for a tags for fetching the href and the accumulation in the loop works as intended, but when i call map with crawl and the urls that have been found on the page, the call to map is ignored.
Also if I try to call (map crawl ["http://www.example.com"]) this call is, again, ignored.
I have started my Clojure adventure a couple of weeks ago so any suggestions/criticisms are most welcomed.
Thank you
In Clojure, map is lazy. From the docs, map:
Returns a lazy sequence consisting of the result of applying f to the
set of first items of each coll, followed by applying f to the set
of second items in each coll, until any one of the colls is
exhausted.
Your crawl function is a function with side effects - you're spit-ing some results to a file, and println-ing to report on progress. But, because map returns a lazy sequence, none of these things will happen - the result sequence is never explicitly realized so it can stay lazy.
There are a number of ways of realizing a lazy sequence (that has been created e.g. using map), but in this case, as you want to iterate over a sequence using a function that has side-effects, it's probably best to use doseq:
Repeatedly executes body (presumably for side-effects) with
bindings and filtering as provided by "for". Does not retain
the head of the sequence. Returns nil.
If you replace the call to (map crawl urls) with (doseq [u urls] (crawl u)), you should get the desired result.
Note: your first call to map works as expected because you are realizing the results using (apply str). There is no way to (apply str) without evaluating the sequence.

clojure read large text file and count occurrences

I'm trying to read a large text file and count occurrences of specific errors.
For example, for the following sample text
something
bla
error123
foo
test
error123
line
junk
error55
more
stuff
I want to end up with (don't really care what data structure although I am thinking a map)
error123 - 2
error55 - 1
Here is what I have tried so far
(require '[clojure.java.io :as io])
(defn find-error [line]
(if (re-find #"error" line)
line))
(defn read-big-file [func, filename]
(with-open [rdr (io/reader filename)]
(doall (map func (line-seq rdr)))))
calling it like this
(read-big-file find-error "sample.txt")
returns:
(nil nil "error123" nil nil "error123" nil nil "error55" nil nil)
Next I tried to remove the nil values and group like items
(group-by identity (remove #(= nil %) (read-big-file find-error "sample.txt")))
which returns
{"error123" ["error123" "error123"], "error55" ["error55"]}
This is getting close to the desired output, although it may not be efficient. How can I get the counts now? Also,as someone new to clojure and functional programming I would appreciate any suggestions on how I might improve this.
thanks!
I think you might be looking for the frequencies function:
user=> (doc frequencies)
-------------------------
clojure.core/frequencies
([coll])
Returns a map from distinct items in coll to the number of times
they appear.
nil
So, this should give you what you want:
(frequencies (remove nil? (read-big-file find-error "sample.txt")))
;;=> {"error123" 2, "error55" 1}
If your text file is really large, however, I would recommend doing this on the line-seq inline to ensure you don't run out of memory. This way you can also use a filter rather than map and remove.
(defn count-lines [pred, filename]
(with-open [rdr (io/reader filename)]
(frequencies (filter pred (line-seq rdr)))))
(defn is-error-line? [line]
(re-find #"error" line))
(count-lines is-error-line? "sample.txt")
;; => {"error123" 2, "error55" 1}