Clojure error - GC overhead limit exceeded - clojure

I'm trying to randomly sample a large FASTQ file and write it to standard out. I keep getting 'GC overhead limit exceeded' errors and I'm not sure what I'm doing wrong. I've tried increasing Xmx in leiningen but that didn't help. Here is my code:
(ns fastq-sample.core
(:gen-class)
(:use clojure.java.io))
(def n-read-pair-lines 8)
(defn sample? [sample-rate]
(> sample-rate (rand)))
;
; Agent for writing the reads asynchronously
;
(def wtr (agent (writer *out*)))
(defn write-out [r]
(letfn [(write [out msg] (.write out msg) out)]
(send wtr write r)))
(defn write-close []
(send wtr #(.close %))
(await wtr))
;
; Main
;
(defn reads [file]
(->>
(input-stream file)
(java.util.zip.GZIPInputStream.)
(reader)
(line-seq)))
(defn -main [fastq-file sample-rate-str]
(let [sample-rate (Float. sample-rate-str)
in-reads (partition n-read-pair-lines (reads fastq-file))]
(doseq [x (filter (fn [_] (sample? sample-rate)) in-reads)]
(write-out (clojure.string/join "\n" x)))
(write-close)
(shutdown-agents)))

This is the same symptom I often get when I try to merge an infinite sequence into a simgle data structure like a map or vector. It very often means that memory was tight and the garbage collector could not keep up with demand for new objects. Most likely the wtr agent is too large for memory. Perhaps you may want to not store the printed results in the atom by changing
(write [out msg] (.write out msg) out)
to
(write [out msg] (.write out msg))

Related

Understanding core.async merge, in Clojure vs ClojureScript

I'm experimenting with core.async on Clojure and ClojureScript, to try and understand how merge works. In particular, whether merge makes any values put on input channels available to take immediately on the merged channel.
I have the following code:
(ns async-merge-example.core
(:require
#?(:clj [clojure.core.async :as async] :cljs [cljs.core.async :as async])
[async-merge-example.exec :as exec]))
(defn async-fn-timeout
[v]
(async/go
(async/<! (async/timeout (rand-int 5000)))
v))
(defn async-fn-exec
[v]
(exec/exec "sh" "-c" (str "sleep " (rand-int 5) "; echo " v ";")))
(defn merge-and-print-results
[seq async-fn]
(let [chans (async/merge (map async-fn seq))]
(async/go
(while (when-let [v (async/<! chans)]
(prn v)
v)))))
When I try async-fn-timeout with a large-ish seq:
(merge-and-print-results (range 20) async-fn-timeout)
For both Clojure and ClojureScript I get the result I expect, as in, results start getting printed pretty much immediately, with the expected delays.
However, when I try async-fn-exec with the same seq:
(merge-and-print-results (range 20) async-fn-exec)
For ClojureScript, I get the result I expect, as in results start getting printed pretty much immediately, with the expected delays. However for Clojure even though the sh processes are executed concurrently (subject to the size of the core.async thread pool), the results appear to be initially delayed, then mostly printed all at once! I can make this difference more obvious by increasing the size of the seq e.g. (range 40)
Since the results for async-fn-timeout are as expected on both Clojure and ClojureScript, the finger is pointed at the differences between the Clojure and ClojureScript implementation for exec..
But I don't know why this difference would cause this issue?
Notes:
These observations were made in WSL on Windows 10
The source code for async-merge-example.exec is below
In exec, the implementation differs for Clojure and ClojureScript due to differences between Clojure/Java and ClojureScript/NodeJS.
(ns async-merge-example.exec
(:require
#?(:clj [clojure.core.async :as async] :cljs [cljs.core.async :as async])))
; cljs implementation based on https://gist.github.com/frankhenderson/d60471e64faec9e2158c
; clj implementation based on https://stackoverflow.com/questions/45292625/how-to-perform-non-blocking-reading-stdout-from-a-subprocess-in-clojure
#?(:cljs (def spawn (.-spawn (js/require "child_process"))))
#?(:cljs
(defn exec-chan
"spawns a child process for cmd with args. routes stdout, stderr, and
the exit code to a channel. returns the channel immediately."
[cmd args]
(let [c (async/chan), p (spawn cmd (if args (clj->js args) (clj->js [])))]
(.on (.-stdout p) "data" #(async/put! c [:out (str %)]))
(.on (.-stderr p) "data" #(async/put! c [:err (str %)]))
(.on p "close" #(async/put! c [:exit (str %)]))
c)))
#?(:clj
(defn exec-chan
"spawns a child process for cmd with args. routes stdout, stderr, and
the exit code to a channel. returns the channel immediately."
[cmd args]
(let [c (async/chan)]
(async/go
(let [builder (ProcessBuilder. (into-array String (cons cmd (map str args))))
process (.start builder)]
(with-open [reader (clojure.java.io/reader (.getInputStream process))
err-reader (clojure.java.io/reader (.getErrorStream process))]
(loop []
(let [line (.readLine ^java.io.BufferedReader reader)
err (.readLine ^java.io.BufferedReader err-reader)]
(if (or line err)
(do (when line (async/>! c [:out line]))
(when err (async/>! c [:err err]))
(recur))
(do
(.waitFor process)
(async/>! c [:exit (.exitValue process)]))))))))
c)))
(defn exec
"executes cmd with args. returns a channel immediately which
will eventually receive a result map of
{:out [stdout-lines] :err [stderr-lines] :exit [exit-code]}"
[cmd & args]
(let [c (exec-chan cmd args)]
(async/go (loop [output (async/<! c) result {}]
(if (= :exit (first output))
(assoc result :exit (second output))
(recur (async/<! c) (update result (first output) #(conj (or % []) (second output)))))))))
Your Clojure implementation uses blocking IO in a single thread. You are first reading from stdout and then stderr in a loop. Both do a blocking readLine so they will only return once they actually finished reading a line. So unless your process creates the same amount of output to stdout and stderr one stream will end up blocking the other one.
Once the process is finished the readLine will no longer block and just return nil once the buffer is empty. So the loop just finishes reading the buffered output and then finally completes explaining the "all at once" messages.
You'll probably want to start a second thread that deals reading from stderr.
node does not do blocking IO so everything happens async by default and one stream doesn't block the other.

clojure java.lang.NullPointerException while spliting string

I am new to clojure. I am trying to write a program which reads data from a file (comma seperated file) after reading the data I am trying to split each line while delimiter "," but I am facing the below error:
CompilerException java.lang.NullPointerException,
compiling:(com\clojure\apps\StudentRanks.clj:26:5)
Here is my code:
(ns com.clojure.apps.StudentRanks)
(require '[clojure.string :as str])
(defn student []
(def dataset (atom []))
(def myList (atom ()))
(def studObj (atom ()))
(with-open [rdr (clojure.java.io/reader "e:\\example.txt")]
(swap! dataset into (reduce conj [] (line-seq rdr)))
)
(println #dataset)
(def studentCount (count #dataset))
(def ind (atom 0))
(loop [n studentCount]
(when (>= n 0)
(swap! myList conj (get #dataset n))
(println (get #dataset n))
(recur (dec n))))
(println myList)
(def scount (count #dataset))
(loop [m scount]
(when (>= m 0)
(def data(get #dataset m))
(println (str/split data #","))
(recur (dec m))))
)
(student)
Thanks in advance.
As pointed out in the comments, the first problem is that you are not writing correct Clojure.
To start, def should never be nested -- it's not going to behave like you hope. Use let to introduce local variables (usually just called locals because it's weird to call variables things that don't vary).
Second, block-like constructs (such as do, let or with-open evaluates to the value of their last expression.
So this snippet
(def dataset (atom []))
(with-open [rdr (clojure.java.io/reader "e:\\example.txt")]
(swap! dataset into (reduce conj [] (line-seq rdr))))
should be written
(let [dataset
(with-open [rdr (clojure.java.io/reader "e:\\example.txt")]
(into [] (line-seq rdr)))]
; code using dataset goes here
)
Then you try to convert dataset (a vector) to a list (myList) by traversing it backwards and consing on the list under construction. It's not needed. You can get a sequence (list-like) out of a vector by just calling seq on it. (Or rseq if you want the list to be reversed.)
Last, you iterate once again to split and print each item held in dataset. Explicit iteration with indices is pretty unusual in Clojure, prefer reduce, doseq, into etc.
Here are two ways to write student:
(defn student [] ; just for print
(with-open [rdr (clojure.java.io/reader "e:\\example.txt")]
(doseq [data (line-seq rdr)]
(println (str/split data #",")))))
(defn student [] ; to return a value
(with-open [rdr (clojure.java.io/reader "e:\\example.txt")]
(into []
(for [data (line-seq rdr)]
(str/split data #",")))))
I hope this will help you to better get Clojure.
I suggest you use a csv library:
(require '[clojure.data.csv :as csv])
(csv/read-csv (slurp "example.txt"))
Unless this is some file io exercise.

Using proper functional style in a file processing task

I have an input csv file and need to generate an output file that has one line for each input line. Each input line could be of a specific type (say "old" or "new") that can be determined only by processing the input line.
In addition to generating the output file, we also want to print the summary of how many lines of each type were in the input file. My actual task involves generating different SQLs based on the input line type, but to keep the example code focussed, I have kept the processing in the function proc-line simple. The function func determines what type an input line is -- again, I have kept it simple by randomly generating a type. The actual logic is more involved.
I have the following code and it does the job. However, to retain a functional style for the task of generating the summary, I chose to return a keyword to signify the type of each line and created a lazy sequence of these for generating the final summary. In an imperative style, we would simply increment a count for each line type. Generating a potentially large collection just for summarizing seems inefficient. Another consequence of the way I have coded it is the repetition of the (.write writer ...) portion. Ideally, I would code that just once.
Any suggestions for eliminating the two problems I have identified (and others)?
(ns file-proc.core
(:gen-class)
(:require [clojure.data.csv :as csv]
[clojure.java.io :as io]))
(defn func [x]
(rand-nth [true false]))
(defn proc-line [line writer]
(if (func line)
(do (.write writer (str line "\n")) :new)
(do (.write writer (str (reverse line) "\n")) :old)))
(defn generate-report [from to]
(with-open
[reader (io/reader from)
writer (io/writer to)]
(->> (csv/read-csv reader)
(rest)
(map #(proc-line % writer))
(frequencies)
(doall))))
I'd try to separate data processing from side-effects like reading/writing files. Hopefully this would allow the IO operations to stay at opposite boundaries of the pipeline, and the "middle" processing logic is agnostic of where the input comes from and where the output is going.
(defn rand-bool [] (rand-nth [true false]))
(defn proc-line [line]
(if (rand-bool)
[line :new]
[(reverse line) :old]))
proc-line no longer takes a writer, it only cares about the line and it returns a vector/2-tuple of the processed line along with a keyword. It doesn't concern itself with string formatting either—we should let csv/write-csv do that. Now you could do something like this:
(defn process-lines [reader]
(->> (csv/read-csv reader)
(rest)
(map proc-line)))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (process-lines reader)]
(csv/write-csv writer (map first lines))
(frequencies (map second lines)))))
This will work but it's going to realize/keep the entire input sequence in memory, which you don't want for large files. We need a way to keep this pipeline lazy/efficient, but we also have to produce two "streams" from one and in a single pass: the processed lines only to be sent to write-csv, and each line's metadata for calculating frequencies. One "easy" way to do this is to introduce some mutability to track the metadata frequencies as the lazy sequence is consumed by write-csv:
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [freqs (atom {})]
(->> (csv/read-csv reader)
;; processing starts
(rest)
(map (fn [line]
(let [[row tag] (proc-line line)]
(swap! freqs update tag (fnil inc 0))
row)))
;; processing ends
(csv/write-csv writer))
#freqs)))
I removed the process-lines call to make the full pipeline more apparent. By the time write-csv has fully (and lazily) consumed its payload, freqs will be a map like {:old 23, :new 31} which will be the return value of generate-report. There's room for improvement/generalization, but I think this is a start.
As others have mentioned, separating writing and processing work would be ideal. Here's how I usually do this:
(defn product-type [p]
(rand-nth [:new :old]))
(defn row->product [row]
(let [p (zipmap [:id :name :price] row)]
(assoc p :type (product-type p))))
(defmulti to-csv :type)
(defmethod to-csv :new [product] ...)
(defmethod to-csv :old [product] ...)
(defn generate-report [from to]
(with-open [rdr (io/reader from)
wrtr (io/writer to)]
(->> (rest (csv/read-csv rdr))
(map row->product)
(map #(do (.write wrtr (to-csv %)) %))
(map :type)
(frequencies)
(doall))))
(The code might not work—didn't run it, sorry.)
Constructing a hash-map and using multimethods is optional, of course, but it's better to assign a product its type first. This way its data dictates what pipeline is doing, not proc-line.
To refactor the code we need the safety net of at least one characterization test for generate-report. Since that function does file I/O (we will make the code independent from I/O later), we will use this sample CSV file, f1.csv:
Year,Code
1997,A
2000,B
2010,C
1996,D
2001,E
We cannot yet write a test because function func uses a RNG, so we rewrite it to be deterministic by actually looking at the input. While there, we rename it to new?, which is more representative of the problem:
(defn new? [row]
(>= (Integer/parseInt (first row)) 2000))
where, for the sake of the exercise, we assume that a row is "new" if the Year column is >= 2000.
We can now write the test and see it pass (here for brevity we focus only on the frequency calculation, not on the output transformation):
(deftest characterization-as-posted
(is (= {:old 2, :new 3}
(generate-report "f1.csv" "f1.tmp"))))
And now to the refactoring. The main idea is to realize that we need an accumulator, replacing map with reduce and getting rid of frequencies and of doall. Also, we rename "line" with "row", since this is how a line is called in the CSV format:
(defn generate-report [from to] ; 1
(let [[old new _] ; 2
(with-open [reader (io/reader from) ; 3
writer (io/writer to)] ; 4
(->> (csv/read-csv reader) ; 5
(rest) ; 6
(reduce process-row [0 0 writer])))] ; 7
{:old old :new new})) ; 8
The new process-row (originally process-line) becomes:
(defn process-row [[old new writer] row]
(if (new? row)
(do (.write writer (str row "\n")) [old (inc new) writer])
(do (.write writer (str (reverse row) "\n")) [(inc old) new writer])))
Function process-row, as any function to be passed to reduce, has two arguments: first argument [old new writer] is a vector of two accumulators and of the I/O writer (the vector is destructured); second argument row is one element of the collection that is being reduced. It returns the new vector of accumulators, that at the end of the collection is destructured in line 2 of generate-report and used at line 8 to create a hashmap equivalent to the one previously returned by frequencies.
We can do one last refactoring: separate the file I/O from the business logic, so that we can write tests without the scaffolding of preparated input files, as follows.
Function process-row becomes:
(defn process-row [[old-cnt new-cnt writer] row]
(let [[out-row old new] (process-row-pure old-cnt new-cnt row)]
(do (.write writer out-row)
[old new writer])))
and the business logic can be done by the pure (and so easily testable) function:
(defn process-row-pure [old new row]
(if (new? row)
[(str row "\n") old (inc new)]
[(str (reverse row) "\n") (inc old) new]))
All this without mutating anything.
IMHO, I would separate the two different aspects: counting the frequencies and writing to a file:
(defn count-lines
([lines] (count-lines lines 0 0))
([lines count-old count-new]
(if-let [line (first lines)]
(if (func line)
(recur count-old (inc count-new) (rest lines))
(recur (inc count-old) count-new (rest lines)))
{:new count-new :old count-old})))
(defn generate-report [from to]
(with-open [reader (io/reader from)
writer (io/writer to)]
(let [lines (rest (csv/read-csv reader))
frequencies (count-lines lines)]
(doseq [line lines]
(.write writer (str line "\n"))))))

Agent/actor like constructs in clojure that operate on all messages received since last update

What's best way in clojure to implement something like an actor or agent (asynchronously updated, uncoordinated reference) that does the following?
gets sent messages/data
executes some function on that data to obtain new state; something like (fn [state new-msgs] ...)
continues to receive messages/data during that update
once done with that update, runs the same update function against all messages that have been sent in the interim
An agent doesn't seem quite right here. One must simultaneously send function and data to agents, which doesn't leave room for a function which operates on all data that has come in during the last update. The goal implicitly requires a decoupling of function and data.
The actor model seems generally better suited in that there is a decoupling of function and data. However, all actor frameworks I'm aware of seem to assume each message sent will be processed separately. It's not clear how one would turn this on it's head without adding extra machinery. I know Pulsar's actors accept a :lifecycle-handle function which can be used to make actors do "special tricks" but there isn't a lot of documentation around this so it's unclear whether the functionality would be helpful.
I do have a solution to this problem using agents, core.async channels, and watch functions, but it's a bit messy, and I'm hoping there is a better solution. I'll post it as a solution in case others find it helpful, but I'd like to see what other's come up with.
Here's the solution I came up with using agents, core.async channels, and watch functions. Again, it's a bit messy, but it does what I need it to for now. Here it is, in broad strokes:
(require '[clojure.core.async :as async :refer [>!! <!! >! <! chan go]])
; We'll call this thing a queued-agent
(defprotocol IQueuedAgent
(enqueue [this message])
(ping [this]))
(defrecord QueuedAgent [agent queue]
IQueuedAgent
(enqueue [_ message]
(go (>! queue message)))
(ping [_]
(send agent identity)))
; Need a function for draining a core async channel of all messages
(defn drain! [c]
(let [cc (chan 1)]
(go (>! cc ::queue-empty))
(letfn
; This fn does all the hard work, but closes over cc to avoid reconstruction
[(drainer! [c]
(let [[v _] (<!! (go (async/alts! [c cc] :priority true)))]
(if (= v ::queue-empty)
(lazy-seq [])
(lazy-seq (cons v (drainer! c))))))]
(drainer! c))))
; Constructor function
(defn queued-agent [& {:keys [buffer update-fn init-fn error-handler-builder] :or {:buffer 100}}]
(let [q (chan buffer)
a (agent (if init-fn (init-fn) {}))
error-handler-fn (error-handler-builder q a)]
; Set up the queue, and watcher which runs the update function when there is new data
(add-watch
a
:update-conv
(fn [k r o n]
(let [queued (drain! q)]
(when-not (empty? queued)
(send a update-fn queued error-handler-fn)))))
(QueuedAgent. a q)))
; Now we can use these like this
(def a (queued-agent
:init-fn (fn [] {:some "initial value"})
:update-fn (fn [a queued-data error-handler-fn]
(println "Receiving data" queued-data)
; Simulate some work/load on data
(Thread/sleep 2000)
(println "Done with work; ready to queue more up!"))
; This is a little warty at the moment, but closing over the queue and agent lets you requeue work on
; failure so you can try again.
:error-handler-builder
(fn [q a] (println "do something with errors"))))
(defn -main []
(doseq [i (range 10)]
(enqueue a (str "data" i))
(Thread/sleep 500) ; simulate things happening
; This part stinks... have to manually let the queued agent know that we've queued some things up for it
(ping a)))
As you'll notice, having to ping the queued-agent here every time new data is added is pretty warty. It definitely feels like things are being twisted out of typical usage.
Agents are the inverse of what you want here - they are a value that gets sent updating functions. This easiest with a queue and a Thread. For convenience I am using future to construct the thread.
user> (def q (java.util.concurrent.LinkedBlockingDeque.))
#'user/q
user> (defn accumulate
[summary input]
(let [{vowels true consonents false}
(group-by #(contains? (set "aeiouAEIOU") %) input)]
(-> summary
(update-in [:vowels] + (count vowels))
(update-in [:consonents] + (count consonents)))))
#'user/accumulate
user> (def worker
(future (loop [summary {:vowels 0 :consonents 0} in-string (.take q)]
(if (not in-string)
summary
(recur (accumulate summary in-string)
(.take q))))))
#'user/worker
user> (.add q "hello")
true
user> (.add q "goodbye")
true
user> (.add q false)
true
user> #worker
{:vowels 5, :consonents 7}
I came up with something closer to an actor, inspired by Tim Baldridge's cast on actors (Episode 16). I think this addresses the problem much more cleanly.
(defmacro take-all! [c]
`(loop [acc# []]
(let [[v# ~c] (alts! [~c] :default nil)]
(if (not= ~c :default)
(recur (conj acc# v#))
acc#))))
(defn eager-actor [f]
(let [msgbox (chan 1024)]
(go (loop [f f]
(let [first-msg (<! msgbox) ; do this so we park efficiently, and only
; run when there are actually messages
msgs (take-all! msgbox)
msgs (concat [first-msg] msgs)]
(recur (f msgs)))))
msgbox))
(let [a (eager-actor (fn f [ms]
(Thread/sleep 1000) ; simulate work
(println "doing something with" ms)
f))]
(doseq [i (range 20)]
(Thread/sleep 300)
(put! a i)))
;; =>
;; doing something with (0)
;; doing something with (1 2 3)
;; doing something with (4 5 6)
;; doing something with (7 8 9 10)
;; doing something with (11 12 13)

How to memoize a function that uses core.async and non-blocking channel read?

I'd like to use memoize for a function that uses core.async and <! e.g
(defn foo [x]
(go
(<! (timeout 2000))
(* 2 x)))
(In the real-life, it could be useful in order to cache the results of server calls)
I was able to achieve that by writing a core.async version of memoize (almost the same code as memoize):
(defn memoize-async [f]
(let [mem (atom {})]
(fn [& args]
(go
(if-let [e (find #mem args)]
(val e)
(let [ret (<! (apply f args))]; this line differs from memoize [ret (apply f args)]
(swap! mem assoc args ret)
ret))))))
Example of usage:
(def foo-memo (memoize-async foo))
(go (println (<! (foo-memo 3)))); delay because of (<! (timeout 2000))
(go (println (<! (foo-memo 3)))); subsequent calls are memoized => no delay
I am wondering if there are simpler ways to achieve the same result.
**Remark: I need a solution that works with <!. For <!!, see this question: How to memoize a function that uses core.async and blocking channel read? **
You can use the built in memoize function for this. Start by defining a method that reads from a channel and returns the value:
(defn wait-for [ch]
(<!! ch))
Note that we'll use <!! and not <! because we want this function block until there is data on the channel in all cases. <! only exhibits this behavior when used in a form inside of a go block.
You can then construct your memoized function by composing this function with foo, like such:
(def foo-memo (memoize (comp wait-for foo)))
foo returns a channel, so wait-for will block until that channel has a value (i.e. until the operation inside foo finished).
foo-memo can be used similar to your example above, except you do not need the call to <! because wait-for will block for you:
(go (println (foo-memo 3))
You can also call this outside of a go block, and it will behave like you expect (i.e. block the calling thread until foo returns).
This was a little trickier than I expected. Your solution isn't correct, because when you call your memoized function again with the same arguments, sooner than the first run finishes running its go block, you will trigger it again and get a miss. This is often the case when you process lists with core.async.
The one below uses core.async's pub/sub to solve this (tested in CLJS only):
(def lookup-sentinel #?(:clj ::not-found :cljs (js-obj))
(def pending-sentinel #?(:clj ::pending :cljs (js-obj))
(defn memoize-async
[f]
(let [>in (chan)
pending (pub >in :args)
mem (atom {})]
(letfn
[(memoized [& args]
(go
(let [v (get #mem args lookup-sentinel)]
(condp identical? v
lookup-sentinel
(do
(swap! mem assoc args pending-sentinel)
(go
(let [ret (<! (apply f args))]
(swap! mem assoc args ret)
(put! >in {:args args :ret ret})))
(<! (apply memoized args)))
pending-sentinel
(let [<out (chan 1)]
(sub pending args <out)
(:ret (<! <out)))
v))))]
memoized)))
NOTE: it probably leaks memory, subscriptions and <out channels are not closed
I have used this function in one of my projects to cache HTTP calls. The function caches results for a given amount of time and uses a barrier to prevent executing the function multiple times when the cache is "cold" (due to the context switch inside the go block).
(defn memoize-af-until
[af ms clock]
(let [barrier (async/chan 1)
last-return (volatile! nil)
last-return-ms (volatile! nil)]
(fn [& args]
(async/go
(>! barrier :token)
(let [now-ms (.now clock)]
(when (or (not #last-return-ms) (< #last-return-ms (- now-ms ms)))
(vreset! last-return (<! (apply af args)))
(vreset! last-return-ms now-ms))
(<! barrier)
#last-return)))))
You can test that it works properly by setting the cache time to 0 and observe that the two function calls take approximately 10 seconds. Without the barrier the two calls would finish at the same time:
(def memo (memoize-af-until #(async/timeout 5000) 0 js/Date))
(async/take! (memo) #(println "[:a] Finished"))
(async/take! (memo) #(println "[:b] Finished"))