How to read n lines from a file in clojure - clojure

I want to read first n lines from a file using clojure. Here is my code:
(defn read-nth-line [file]
(with-open [rdr (reader file)]
(loop [line-number 0]
(when (< line-number 20)
(nth (line-seq rdr) line-number)
(recur (inc line-number))))))
but when I run
user=> (read-nth-line "test.txt")
IndexOutOfBoundsException clojure.lang.RT.nthFrom (RT.java:871)
I have no idea why I got such an error.

Your code produces an out-of-bounds error because you call line-seq multiple times on the same reader. If you want to get a number of lines from a reader, you should call line-seq only once, then take the desired number of lines from that sequence:
(require '[clojure.java.io :as io])
(defn lines [n filename]
(with-open [rdr (io/reader filename)]
(doall (take n (line-seq rdr)))))
Example:
(run! println (lines 20 "test.txt"))
If test.txt contains fewer than 20 lines, this will simply print all the lines in the file.

Related

Consuming file contents with Clojure's core.async

I'm trying to use Clojure's core.async library to consume/process lines from a file. When my code executes an IOException: Stream closed is throw. Below is a REPL session that reproduces the same problem as in my code:
(require '[clojure.core.async :as async])
(require '[clojure.java.io :as io])
; my real code is a bit more involved with calls to drop, map, filter
; following line-seq
(def lines
(with-open [reader (io/reader "my-file.txt")]
(line-seq reader)))
(def ch
(let [c (async/chan)]
(async/go
(doseq [ln lines]
(async/>! c ln))
(async/close! c))
c))
; line that causes the error
; java.io.IOException: Stream closed
(async/<!! ch)
Since its is my first time doing something like this (async + file), maybe I have some misconceptions about how it should work. Can someone clarify what is the correct approach to send file lines into a channels pipeline?
Thanks!
As #Alan pointed out, your definition of lines closes the file without reading all of its lines, because line-seq returns a lazy sequence. If you expand your use of the with-open macro...
(macroexpand-1
'(with-open [reader (io/reader "my-file.txt")]
(line-seq reader)))
... you get this:
(clojure.core/let [reader (io/reader "my-file.txt")]
(try
(clojure.core/with-open []
(line-seq reader))
(finally
(. reader clojure.core/close))))
You can fix this problem by closing the file after you finish reading from it, rather than immediately:
(def ch
(let [c (async/chan)]
(async/go
(with-open [reader (io/reader "my-file.txt")]
(doseq [ln (line-seq reader)]
(async/>! c ln)))
(async/close! c))
c))
Your problem is the with-open statement. The file is closed as soon as this scope is exited. So, you open a line-seq and then close the file before reading any lines.
You will be better off for most files using the slurp function:
(require '[clojure.string :as str])
(def file-as-str (slurp "my-file.txt"))
(def lines (str/split-lines file-as-str))
See:
http://clojuredocs.org/clojure.core/slurp
http://clojuredocs.org/clojure.string/split-lines

counting lines in a file with a filter with clojure

I'm trying to figure out what is wrong with my code here. Basically the idea behind it is that I am reading a very large file and at the end of each line in the file is a number. I want to count the number of lines that have the number at the end greater than 500.
What I have is this and on paper it should work, but something is going wrong and I keep returning nil.
(defn countlines [] (with-open [rdr (clojure.java.io/reader "myfile.txt")]
(doseq [line (line-seq rdr)]
(count (re-find #"(?!500)[56789]\d{2,}|\d{4,}$" line)))))
the reason is that you use doseq:
clojure.core/doseq
[seq-exprs & body]
Macro
Added in 1.0
Repeatedly executes body (presumably for side-effects) with
bindings and filtering as provided by "for". Does not retain
the head of the sequence. Returns nil.
you should probably rewrite it to something like (doall (for [line (line-seq rdr)] ...
but to fulfill your task you need to rewrite it (because your function would return a seq of counts of chars in matches:
user> (count (re-find #"\d+" "123k456"))
3
which is obviously not what you want
what you need to do is:
(count (filter #(re-find #"(?!500)[56789]\d{2,}|\d{4,}$" %)
(line-seq rdr)))
If I understand the question correctly, you should be doing something like this:
(defn countlines [] (with-open [rdr (clojure.java.io/reader "myfile.txt")]
(-> (line-seq rdr)
(filter #(re-find #"(?!500)[56789]\d{2,}|\d{4,}$" %))
(count))))
About Martin Lechner's answer, I think should use Thread last(->>) rather than Thread first(->). So it should be
(defn countlines [] (with-open [rdr (clojure.java.io/reader "myfile.txt")] (->> (line-seq rdr)
(filter #(re-find #"(?!500)[56789]\d{2,}|\d{4,}$" %))
(count))))

Read file until certain line in Clojure using doseq

This would normally be trivial in other language, but I've found no such example in Clojure.
I can println an entire file using:
(with-open [rdr (io/reader "file")]
(doseq [line (line-seq rdr) :while (< count(line) 10)]
(println line)))
But how do I get it to stop at line 5?
Thanks.
You can try this:
(println
(with-open [rdr (clojure.java.io/reader "file")]
(let [ls (line-seq rdr)]
(doall (take 5 ls)))))
This will print first 5 lines of the specified file.
If you need skip some lines that does not satisfy the condition, you can add filter. The following code will print first five lines that the length is less than 10.
(println
(with-open [rdr (clojure.java.io/reader "file")]
(let [ls (line-seq rdr)]
(->> ls
(filter #(< (count %) 10))
(take 5)
(doall)))))
Since filter and take returns lazy sequence, it should be realized within the with-open form. Outside the with-open form, the sequence couldn't be realized and cause exception.
println function also make the sequence realized, you can modify the code like this:
(with-open [rdr (clojure.java.io/reader "data/base_exp.txt")]
(let [ls (line-seq rdr)]
(->> ls
(filter #(> (count %) 10))
(take 5)
(println))))
Simply use take to limit the amount of lines:
Replace
(doseq [line (line-seq rdr) ;; ...
with
(doseq [line (take 5 (line-seq rdr)) ;; ...

why does this Clojure code run out of memory?

I have a twenty million line, sorted text file. It has lots of duplicate lines. I have some Clojure code that figures out how many instances there are of each unique line, i.e. the output is something like:
alpha 20
beta 17
gamma 3
delta 4
...
The code works for smaller files, but on this larger one, it runs out of memory. What am I doing wrong? I assume that somewhere I am holding on to the head.
(require '[clojure.java.io :as io])
(def bi-grams (line-seq (io/reader "the-big-input-file.txt")))
(defn quick-process [input-list filename]
(with-open [out (io/writer filename)] ;; e.g. "train/2gram-freq.txt"
(binding [*out* out]
(dorun (map (fn [[w v]] (println w "\t" (count v)))
(partition-by identity input-list)))
(quick-process bi-grams "output.txt")
Your bi-grams variable is holding on to the head of the line-seq.
Try (quick-process (line-seq (io/reader "the-big-input-file.txt")) "output.txt").

Clojure: buffered reader in for loop

I have a large text file I want to process in Clojure.
I need to process it 2 lines at a time.
I settled on using a for loop so I could pull 2 lines for each pass with the following binding (rdr is my reader):
[[line-a line-b] (partition 2 (line-seq rdr))]
(I would be interested in knowing other ways to get 2 lines for each loop iteration but that is not the point of my question).
When trying to get the loop to work (using a simpler binding for these tests), I am seeing the following behavior that I can't explain:
Why does
(with-open [rdr (reader "path/to/file")]
(for [line (line-seq rdr)]
line))
trigger a Stream closed exception
while
(with-open [rdr (reader "path/to/file")]
(doseq [line (line-seq rdr)]
(println line)))
works?
for is lazy and just returns the head of the sequence that will eventually read the data from the file. The file is already closed when the for's contents are printed by your repl. you can fix this pu wrapping the for in a doall
(with-open [rdr (reader "path/to/file")]
(doall (for [line (line-seq rdr)]
line)))
Though this unlazys the sequence.
here is a sample of a function out of my misc.clj that lazily closes the file at it's end:
(defn byte-seq [rdr]
"create a lazy seq of bytes in a file and close the file at the end"
(let [result (. rdr read)]
(if (= result -1)
(do (. rdr close) nil)
(lazy-seq (cons result (byte-seq rdr))))))