Clojure : why does this writer consume so much heap space? - clojure

I have a 700 mb XML file that I process from a records tree to an EDN file.
After having do all the processing, I finally have a lazy sequence of hashmaps that are not particularely big (at most 10 values).
To finish, I want to write it to a file with
(defn write-catalog [catalog-edn]
(with-open [wrtr (io/writer "catalog-fr.edn")]
(doseq [x catalog-edn]
(.write wrtr (prn-str x)))))
I do not understand the problem because doseq is supposed to do not retain the head of the sequence in memory.
My final output catalog is of type clojure.lang.LazySeq.
I then do
(write-catalog catalog)
Then memory usage is grinding and I have a GC overhead error at around 80mb of file writter with a XmX of 3g.
I tried also with a doseq + spit and no prn-str, same thing happen.
Is this a normal behaviour ?
Thanks

Possibly the memory leaks due to the catalog values realization (google "head retention"). When your write-catalog realizes items one by one, they are kept in memory (obviously you're def'fing catalog somewhere). To fix this you may try to avoid keeping your catalog in a variable, instead pass it to the write-catalog at once. Like if you parse it from somewhere (which i guess is true, considering your previous question), you would want to do:
(write-catalog (transform-catalog (get-catalog "mycatalog.xml")))
so huge intermediate sequences won't eat all your memory
Hope it helps.

Related

Clojure: memory leaks using eval

My application evaluates quoted expressions received from remote clients. Overtime, my system's memory increases and eventually it crashes. What I've found out is that:
When I execute the following code from Clojure's nrepl in a docker container:
(dotimes [x 1000000] ; or some arbitrary large number
(eval '(+ 1 1)))
the container's memory usage keeps rising until it hits the limit, at which point the system will crash.
How do I get around this problem?
There's another thread mentioning this behavior. One of the answers mentions the use of tools.reader, which still uses eval if I need code execution, leading to the same problem.
There's no easy way to get around this as each call to eval creates a new class, even though the form you're evaluating is exactly the same. By itself, JVM will not get rid of new classes.
There are two ways to circumvent this:
Stop using eval altogether (by e.g. creating your own DSL or your own version of eval with limited functionality) or at least use it less frequently, e.g. by batching the forms you need to evaluate
Unload already loaded classes - I haven't done it myself and it probably requires a lot of work, but you can follow answers in this topic: Unloading classes in java?
I don't know how eval exactly works internally,
but based on my observations I don't think your conclusions are correct and also Eugene's remark "By itself, JVM will not get rid of new classes" seems to be false.
I run your sample with -Xmx256m and it went fine.
(time (dotimes [x 1000000] ; or some arbitrary large number
(eval '(+ 1 1))))
;; "Elapsed time: 529079.032449 msecs"
I checked the question you linked and they say it's Metaspace that is growing not heap.
So I observed the Metaspace and it's usage is growing but also shrinking.
You can find my experiment here together with some graphs from JMC console: https://github.com/jumarko/clojure-experiments/commit/824f3a69019840940eaa88c3427515bcba33c4d2
Note: To run this experiment, I've used JDK 17.0.2 on macOS

Clojure - process huge files with low memory

I am processing text files 60GB or larger. The files are seperated into a header section of variable length and a data section. I have three functions:
head? a predicate to distinguish header lines from data lines
process-header process one header line string
process-data process one data line string
The processing functions asynchronously access and modify an in-memory database
I advanced on a file reading method from another SO thread, which should build a lazy sequence of lines. The idea was to process some lines with one function, then switch the function once and keep processing with the next function.
(defn lazy-file
[file-name]
(letfn [(helper [rdr]
(lazy-seq
(if-let [line (.readLine rdr)]
(cons line (helper rdr))
(do (.close rdr) nil))))]
(try
(helper (clojure.java.io/reader file-name))
(catch Exception e
(println "Exception while trying to open file" file-name)))))
I use it with something like
(let [lfile (lazy-file "my-file.txt")]
(doseq [line lfile :while head?]
(process-header line))
(doseq [line (drop-while head? lfile)]
(process-data line)))
Although that works, it's rather inefficient for a couple of reasons:
Instead of simply calling process-head until I reach the data and then continuing with process-data, I have to filter header lines and process them, then restart parsing the whole file and drop all header lines to process data. This is the exact opposite of what lazy-file intended to do.
Watching memory consumption shows me, that the program, though seemingly lazy, builds up to use as much RAM as would be required to keep the file in memory.
So what is a more efficient, idiomatic way to work with my database?
One idea might be using a multimethod to process header and data dependant on the value of the head? predicate, but I suppose this would have some serious speed impact, especially as there is only one occurence where the predicate outcome changes from alway true to always false. I didn't benchmark that yet.
Would it be better to use another way to build the line-seq and parse it with iterate? This would still leave me needing to use :while and :drop-while, I guess.
In my research, using NIO file access was mentioned a couple of times, which should improve memory usage. I could not find out yet how to use that in an idiomatic way in clojure.
Maybe I still have a bad grasp of the general idea, how the file should be treated?
As always, any help, ideas or pointers to tuts are greatly appreciated.
You should use standard library functions.
line-seq, with-open and doseq will easily do the job.
Something in the line of:
(with-open [rdr (clojure.java.io/reader file-path)]
(doseq [line (line-seq rdr)]
(if (head? line)
(process-header line)
(process-data line))))
There are several things to consider here:
Memory usage
There are reports that leiningen might add stuff that results in keeping references to the head, although doseq specifically does not hold on to the head of the sequence it's processing, cf. this SO question. Try verifying your claim "use as much RAM as would be required to keep the file in memory" without using lein repl.
Parsing lines
Instead of using two loops with doseq, you could also use a loop/recur approach. What you expect to be parsing would be a second argument like this (untested):
(loop [lfile (lazy-file "my-file.txt")
parse-header true]
(let [line (first lfile)]
(if [and parse-header (head? line)]
(do (process-header line)
(recur (rest lfile) true))
(do (process-data line)
(recur (rest lfile) false)))))
There is another option here, which would be to incorporate your processing functions into your file reading function. So, instead of just consing a new line and returning it, you could just as well process it right away -- typically you could hand over the processing function as an argument instead of hard-coding it.
Your current code looks like processing is a side-effect. If so, you could then probably do away with the laziness if you incorporate the processing. You need to process the entire file anyway (or so it seems) and you do so on a per-line basis. The lazy-seq approach basically just aligns a single line read with a single processing call. Your need for laziness arises in the current solution because you separate reading (the entire file, line by line) from processing. If you instead move the processing of a line into the reading, you don't need to do that lazily.

Clojure: how to simulate adding metadata to characters in stream?

Having a function which returns a seq of characters, I need to modify it to allow attaching metadata to some characters (but not all). Clojure doesn't support 'with-meta' on primitive types. So, the possible options are:
return a seq of vectors of [character, metadata];
pros: simplicity, data and metadata are tied together
cons: need to extract data from vector
return two separate seqs, one for characters and one for metadata, caller most iterate those simultaneously if he cares about metadata;
pros: caller is not forced to extract data from each stream element and may throw away meta-sequence if he wishes
cons: need to iterate both seqs at once, more complexity on caller side if metadata is needed
introduce some record-wrapper containing one character and allowing to attach meta to itself (Clojure records implement IMeta);
pros: data and metadata are tied together
cons: need to extract data from record
your better option.
Which approach is better?
Using vector/map sequence, e.g.
({:char 'x' :meta <...>} {:char 'y' :meta <...>} {:char 'z' :meta <...>} ...)
; or
(['x' <...>] ['y' <...>] ['z' <...>] ...)
looks like the best option for me, that's what I'd do myself if I had such task. Then, for example, writing a function which transforms such sequence back to sequence of chars is very simple:
(defn characters [s] (map :char s))
; or
(defn characters [s] (map first s))
Iterating through characters and metadata at the same time is also very easy using destructuring bindings:
(doseq [{:keys [char meta]} s] ...)
; or
(doseq [[char meta] s] ...)
What to use (map or vector) is mostly a matter of personal preference.
IMO, using records and their IMeta interface is not quite as good: I think that this kind of metadata is mostly intended for language-related code (macros, code preprocessing, syntax extensions etc) and not for domain code. Of course I may be wrong in this assumption.
And using two parallel sequences is the worst option because it is not as convenient for the user of your interface as single sequence. And throwing away the metadata is very simple with the function I wrote above, and it will not even have performance implications if all sequences are lazy.

Using `line-seq` with `reader`, when is the file closed?

I'm reading lines from a text file using (line-seq (reader "input.txt")). This collection is then passed around and used by my program.
I'm concerned that this may be bad style however, as I'm not deterministically closing the file. I imagine that I can't use (with-open (line-seq (reader "input.txt"))), as the file stream will potentially get closed before I've traversed the entire sequence.
Should lazy-seq be avoided in conjunction with reader for files? Is there a different pattern I should be using here?
Since this doesn't really have a clear answer (it's all mixed into comments on the first answer), here's the essence of it:
(with-open [r (reader "input.txt")]
(doall (line-seq r)))
That will force the whole sequence of lines to be read and close the file. You can then pass the result of that whole expression around.
When dealing with large files, you may have memory problems (holding the whole sequence of lines in memory) and that's when it's a good idea to invert the program:
(with-open [r (reader "input.txt")]
(doall (my-program (line-seq r))))
You may or may not need doall in that case, depending on what my-program returns and/or whether my-program consumes the sequence lazily or not.
It seems like the clojure.contrib.duck-streams/read-lines is just what you are looking for. read-lines closes the file when there is no input and returns the sequence just like line-seq. Try to look at source code of read-lines.

interaction between seque and pmap

If I pmap a function onto a sequence how far ahead will the sequence be realised in parallel
If I have a single thread reading from a sequence being produced in parallel?
will this be different if I wrap it in a seque:
(seque 30 (pmap do-stuff (range 30000)))
vs.
(pmap do-stuff (range 30000))
pmap provides no guarantees of how far ahead it will read on it's input sequence - presumably, not too much farther than what it needs to do its calculation.
(seque 30 ...) will realize and cache up to 30 elements from pmap's output sequence. That must logically be at least the first 30 from the input sequence. How far beyond that I can't say without looking at the implementation of pmap, which you probably ought not to depend upon.
I'm curious why you need to know this. The details of when a function executes, particularly in a pmap, are something you usually want abstracted away. If it's curiosity, great. But if you're depending on some side effect of the do-stuff function, you're Doing It Wrong(tm).