Consuming file contents with Clojure's core.async - clojure

I'm trying to use Clojure's core.async library to consume/process lines from a file. When my code executes an IOException: Stream closed is throw. Below is a REPL session that reproduces the same problem as in my code:
(require '[clojure.core.async :as async])
(require '[clojure.java.io :as io])
; my real code is a bit more involved with calls to drop, map, filter
; following line-seq
(def lines
(with-open [reader (io/reader "my-file.txt")]
(line-seq reader)))
(def ch
(let [c (async/chan)]
(async/go
(doseq [ln lines]
(async/>! c ln))
(async/close! c))
c))
; line that causes the error
; java.io.IOException: Stream closed
(async/<!! ch)
Since its is my first time doing something like this (async + file), maybe I have some misconceptions about how it should work. Can someone clarify what is the correct approach to send file lines into a channels pipeline?
Thanks!

As #Alan pointed out, your definition of lines closes the file without reading all of its lines, because line-seq returns a lazy sequence. If you expand your use of the with-open macro...
(macroexpand-1
'(with-open [reader (io/reader "my-file.txt")]
(line-seq reader)))
... you get this:
(clojure.core/let [reader (io/reader "my-file.txt")]
(try
(clojure.core/with-open []
(line-seq reader))
(finally
(. reader clojure.core/close))))
You can fix this problem by closing the file after you finish reading from it, rather than immediately:
(def ch
(let [c (async/chan)]
(async/go
(with-open [reader (io/reader "my-file.txt")]
(doseq [ln (line-seq reader)]
(async/>! c ln)))
(async/close! c))
c))

Your problem is the with-open statement. The file is closed as soon as this scope is exited. So, you open a line-seq and then close the file before reading any lines.
You will be better off for most files using the slurp function:
(require '[clojure.string :as str])
(def file-as-str (slurp "my-file.txt"))
(def lines (str/split-lines file-as-str))
See:
http://clojuredocs.org/clojure.core/slurp
http://clojuredocs.org/clojure.string/split-lines

Related

Understanding core.async merge, in Clojure vs ClojureScript

I'm experimenting with core.async on Clojure and ClojureScript, to try and understand how merge works. In particular, whether merge makes any values put on input channels available to take immediately on the merged channel.
I have the following code:
(ns async-merge-example.core
(:require
#?(:clj [clojure.core.async :as async] :cljs [cljs.core.async :as async])
[async-merge-example.exec :as exec]))
(defn async-fn-timeout
[v]
(async/go
(async/<! (async/timeout (rand-int 5000)))
v))
(defn async-fn-exec
[v]
(exec/exec "sh" "-c" (str "sleep " (rand-int 5) "; echo " v ";")))
(defn merge-and-print-results
[seq async-fn]
(let [chans (async/merge (map async-fn seq))]
(async/go
(while (when-let [v (async/<! chans)]
(prn v)
v)))))
When I try async-fn-timeout with a large-ish seq:
(merge-and-print-results (range 20) async-fn-timeout)
For both Clojure and ClojureScript I get the result I expect, as in, results start getting printed pretty much immediately, with the expected delays.
However, when I try async-fn-exec with the same seq:
(merge-and-print-results (range 20) async-fn-exec)
For ClojureScript, I get the result I expect, as in results start getting printed pretty much immediately, with the expected delays. However for Clojure even though the sh processes are executed concurrently (subject to the size of the core.async thread pool), the results appear to be initially delayed, then mostly printed all at once! I can make this difference more obvious by increasing the size of the seq e.g. (range 40)
Since the results for async-fn-timeout are as expected on both Clojure and ClojureScript, the finger is pointed at the differences between the Clojure and ClojureScript implementation for exec..
But I don't know why this difference would cause this issue?
Notes:
These observations were made in WSL on Windows 10
The source code for async-merge-example.exec is below
In exec, the implementation differs for Clojure and ClojureScript due to differences between Clojure/Java and ClojureScript/NodeJS.
(ns async-merge-example.exec
(:require
#?(:clj [clojure.core.async :as async] :cljs [cljs.core.async :as async])))
; cljs implementation based on https://gist.github.com/frankhenderson/d60471e64faec9e2158c
; clj implementation based on https://stackoverflow.com/questions/45292625/how-to-perform-non-blocking-reading-stdout-from-a-subprocess-in-clojure
#?(:cljs (def spawn (.-spawn (js/require "child_process"))))
#?(:cljs
(defn exec-chan
"spawns a child process for cmd with args. routes stdout, stderr, and
the exit code to a channel. returns the channel immediately."
[cmd args]
(let [c (async/chan), p (spawn cmd (if args (clj->js args) (clj->js [])))]
(.on (.-stdout p) "data" #(async/put! c [:out (str %)]))
(.on (.-stderr p) "data" #(async/put! c [:err (str %)]))
(.on p "close" #(async/put! c [:exit (str %)]))
c)))
#?(:clj
(defn exec-chan
"spawns a child process for cmd with args. routes stdout, stderr, and
the exit code to a channel. returns the channel immediately."
[cmd args]
(let [c (async/chan)]
(async/go
(let [builder (ProcessBuilder. (into-array String (cons cmd (map str args))))
process (.start builder)]
(with-open [reader (clojure.java.io/reader (.getInputStream process))
err-reader (clojure.java.io/reader (.getErrorStream process))]
(loop []
(let [line (.readLine ^java.io.BufferedReader reader)
err (.readLine ^java.io.BufferedReader err-reader)]
(if (or line err)
(do (when line (async/>! c [:out line]))
(when err (async/>! c [:err err]))
(recur))
(do
(.waitFor process)
(async/>! c [:exit (.exitValue process)]))))))))
c)))
(defn exec
"executes cmd with args. returns a channel immediately which
will eventually receive a result map of
{:out [stdout-lines] :err [stderr-lines] :exit [exit-code]}"
[cmd & args]
(let [c (exec-chan cmd args)]
(async/go (loop [output (async/<! c) result {}]
(if (= :exit (first output))
(assoc result :exit (second output))
(recur (async/<! c) (update result (first output) #(conj (or % []) (second output)))))))))
Your Clojure implementation uses blocking IO in a single thread. You are first reading from stdout and then stderr in a loop. Both do a blocking readLine so they will only return once they actually finished reading a line. So unless your process creates the same amount of output to stdout and stderr one stream will end up blocking the other one.
Once the process is finished the readLine will no longer block and just return nil once the buffer is empty. So the loop just finishes reading the buffered output and then finally completes explaining the "all at once" messages.
You'll probably want to start a second thread that deals reading from stderr.
node does not do blocking IO so everything happens async by default and one stream doesn't block the other.

counting lines in a file with a filter with clojure

I'm trying to figure out what is wrong with my code here. Basically the idea behind it is that I am reading a very large file and at the end of each line in the file is a number. I want to count the number of lines that have the number at the end greater than 500.
What I have is this and on paper it should work, but something is going wrong and I keep returning nil.
(defn countlines [] (with-open [rdr (clojure.java.io/reader "myfile.txt")]
(doseq [line (line-seq rdr)]
(count (re-find #"(?!500)[56789]\d{2,}|\d{4,}$" line)))))
the reason is that you use doseq:
clojure.core/doseq
[seq-exprs & body]
Macro
Added in 1.0
Repeatedly executes body (presumably for side-effects) with
bindings and filtering as provided by "for". Does not retain
the head of the sequence. Returns nil.
you should probably rewrite it to something like (doall (for [line (line-seq rdr)] ...
but to fulfill your task you need to rewrite it (because your function would return a seq of counts of chars in matches:
user> (count (re-find #"\d+" "123k456"))
3
which is obviously not what you want
what you need to do is:
(count (filter #(re-find #"(?!500)[56789]\d{2,}|\d{4,}$" %)
(line-seq rdr)))
If I understand the question correctly, you should be doing something like this:
(defn countlines [] (with-open [rdr (clojure.java.io/reader "myfile.txt")]
(-> (line-seq rdr)
(filter #(re-find #"(?!500)[56789]\d{2,}|\d{4,}$" %))
(count))))
About Martin Lechner's answer, I think should use Thread last(->>) rather than Thread first(->). So it should be
(defn countlines [] (with-open [rdr (clojure.java.io/reader "myfile.txt")] (->> (line-seq rdr)
(filter #(re-find #"(?!500)[56789]\d{2,}|\d{4,}$" %))
(count))))

How to read n lines from a file in clojure

I want to read first n lines from a file using clojure. Here is my code:
(defn read-nth-line [file]
(with-open [rdr (reader file)]
(loop [line-number 0]
(when (< line-number 20)
(nth (line-seq rdr) line-number)
(recur (inc line-number))))))
but when I run
user=> (read-nth-line "test.txt")
IndexOutOfBoundsException clojure.lang.RT.nthFrom (RT.java:871)
I have no idea why I got such an error.
Your code produces an out-of-bounds error because you call line-seq multiple times on the same reader. If you want to get a number of lines from a reader, you should call line-seq only once, then take the desired number of lines from that sequence:
(require '[clojure.java.io :as io])
(defn lines [n filename]
(with-open [rdr (io/reader filename)]
(doall (take n (line-seq rdr)))))
Example:
(run! println (lines 20 "test.txt"))
If test.txt contains fewer than 20 lines, this will simply print all the lines in the file.

How i can deserialize record structure from file, already saved to file with print-dup?

I'm have a following code:
(use 'clojure.java.io)
(defrecord Member [id name salary role])
(defrecord Role [id name])
(def member-records (ref ()))
(defn add-member [member]
(dosync (alter member-records conj member)))
;;Test-data -->
(def dev-r(->Role 1 "Developer"))
(def test-member1(->Member 1 "Kirill" 70000.00 dev-r))
;;Test-data <--
(defn save-data-2-file []
(with-open [wrtr (writer "C:/Platform/Work/test.cdf")]
(print-dup #member-records wrtr)))
(defn process-line [line]
(println line))
;;Test line content
;;#BTC.pcost.Member{:id 1, :name "Kirill", :salary 70000.0, :role #BTC.pcost.Role{:id 1, :name "Developer"}})
(defn load-data-from-file []
(with-open [rdr (reader "C:/Platform/Work/test.cdf")]
(doseq [line (line-seq rdr)]
(process-line line))))
I'm want to recreate records after reading file, but i can not understand how i can make it. Yes, i'm know that i can parse text and fill my structure by the elements of parsed line, but it's will be difficult, cause i'm have alot structs like "Member" and "Role". Can anyone to suggest me a way, that i can do?
You can use read-string, and slurp, to pull the records out of the file. read-string is limited to reading the first form of a string, but, from your sample, you are only storing a single form, as a list of records.
(defn load-data-from-file [file]
(read-string (slurp file)))
Lazy Reading
If you need more than the first form, or cannot read the entire stream into memory, you can use read directly, to make a lazy reader.
(defn lazy-read
([rdr] (let [eof (Object.)] (lazy-read rdr (read rdr false eof) eof)))
([rdr data eof]
(if (not= eof data)
(cons data (lazy-seq (lazy-read rdr (read rdr false eof) eof))))))
(defn load-all-data [file]
(with-open [rdr (java.io.PushbackReader. (reader file))]
(doall (lazy-read rdr))))
(load-all-data "C:/Platform/Work/test.cdf")
Security
Also, it is good to mention security when loading code with read-string or read. You should only use them with trusted sources, because, using #= or a Java constructor, the source can execute arbitrary code inside your application. For a longer explanation, take a look at the documentation for read.
Setting *read-eval* to false would prevent the issue, but it would also prevent the reconstruction of the records in your sample. To avoid the issue all together, you can use the clojure.edn/read and clojure.edn/read-string functions, with a whitelist of readers.
(defn edn-read [eof rdr]
(clojure.edn/read {:eof eof :readers {'BTC.pcost.Role map->Role
'BTC.pcost.Member map->Member}}
rdr))
(defn lazy-edn-read
([rdr] (let [eof (Object.)] (lazy-edn-read rdr (edn-read eof rdr) eof)))
([rdr data eof]
(if (not= eof data)
(cons data (lazy-seq (lazy-edn-read rdr (edn-read eof rdr) eof))))))
(defn load-all-data [file]
(with-open [rdr (java.io.PushbackReader. (reader file))]
(doall (take-while (complement nil?) (lazy-edn-read rdr)))))
(load-all-data "C:/Platform/Work/test.cdf")
You can use read.
This function will read one object from a file:
(defn load-data-from-file [filename]
(with-open [rdr (java.io.PushbackReader. (reader filename))]
(read rdr)))
Or this will read all objects from the file:
(defn load-all-data-from-file [filename]
(let [eof (Object.)]
(with-open [rdr (java.io.PushbackReader. (reader filename))]
(doall
(take-while #(not= % eof)
(repeatedly #(read rdr nil eof)))))))
Here's the API documentation for read.
This is a small variation that will read all objects from a string:
(defn load-all-data-from-string [string]
(let [eof (Object.)]
(with-open [rdr (-> string java.io.StringReader. java.io.PushbackReader.)]
(doall
(take-while #(not= % eof)
(repeatedly #(read rdr nil eof)))))))
This is, as far as I know, not possible to do using read-string. Instead we use read with a java.io.StringReader.

Clojure: buffered reader in for loop

I have a large text file I want to process in Clojure.
I need to process it 2 lines at a time.
I settled on using a for loop so I could pull 2 lines for each pass with the following binding (rdr is my reader):
[[line-a line-b] (partition 2 (line-seq rdr))]
(I would be interested in knowing other ways to get 2 lines for each loop iteration but that is not the point of my question).
When trying to get the loop to work (using a simpler binding for these tests), I am seeing the following behavior that I can't explain:
Why does
(with-open [rdr (reader "path/to/file")]
(for [line (line-seq rdr)]
line))
trigger a Stream closed exception
while
(with-open [rdr (reader "path/to/file")]
(doseq [line (line-seq rdr)]
(println line)))
works?
for is lazy and just returns the head of the sequence that will eventually read the data from the file. The file is already closed when the for's contents are printed by your repl. you can fix this pu wrapping the for in a doall
(with-open [rdr (reader "path/to/file")]
(doall (for [line (line-seq rdr)]
line)))
Though this unlazys the sequence.
here is a sample of a function out of my misc.clj that lazily closes the file at it's end:
(defn byte-seq [rdr]
"create a lazy seq of bytes in a file and close the file at the end"
(let [result (. rdr read)]
(if (= result -1)
(do (. rdr close) nil)
(lazy-seq (cons result (byte-seq rdr))))))