Clojure - process huge files with low memory - clojure

I am processing text files 60GB or larger. The files are seperated into a header section of variable length and a data section. I have three functions:
head? a predicate to distinguish header lines from data lines
process-header process one header line string
process-data process one data line string
The processing functions asynchronously access and modify an in-memory database
I advanced on a file reading method from another SO thread, which should build a lazy sequence of lines. The idea was to process some lines with one function, then switch the function once and keep processing with the next function.
(defn lazy-file
[file-name]
(letfn [(helper [rdr]
(lazy-seq
(if-let [line (.readLine rdr)]
(cons line (helper rdr))
(do (.close rdr) nil))))]
(try
(helper (clojure.java.io/reader file-name))
(catch Exception e
(println "Exception while trying to open file" file-name)))))
I use it with something like
(let [lfile (lazy-file "my-file.txt")]
(doseq [line lfile :while head?]
(process-header line))
(doseq [line (drop-while head? lfile)]
(process-data line)))
Although that works, it's rather inefficient for a couple of reasons:
Instead of simply calling process-head until I reach the data and then continuing with process-data, I have to filter header lines and process them, then restart parsing the whole file and drop all header lines to process data. This is the exact opposite of what lazy-file intended to do.
Watching memory consumption shows me, that the program, though seemingly lazy, builds up to use as much RAM as would be required to keep the file in memory.
So what is a more efficient, idiomatic way to work with my database?
One idea might be using a multimethod to process header and data dependant on the value of the head? predicate, but I suppose this would have some serious speed impact, especially as there is only one occurence where the predicate outcome changes from alway true to always false. I didn't benchmark that yet.
Would it be better to use another way to build the line-seq and parse it with iterate? This would still leave me needing to use :while and :drop-while, I guess.
In my research, using NIO file access was mentioned a couple of times, which should improve memory usage. I could not find out yet how to use that in an idiomatic way in clojure.
Maybe I still have a bad grasp of the general idea, how the file should be treated?
As always, any help, ideas or pointers to tuts are greatly appreciated.

You should use standard library functions.
line-seq, with-open and doseq will easily do the job.
Something in the line of:
(with-open [rdr (clojure.java.io/reader file-path)]
(doseq [line (line-seq rdr)]
(if (head? line)
(process-header line)
(process-data line))))

There are several things to consider here:
Memory usage
There are reports that leiningen might add stuff that results in keeping references to the head, although doseq specifically does not hold on to the head of the sequence it's processing, cf. this SO question. Try verifying your claim "use as much RAM as would be required to keep the file in memory" without using lein repl.
Parsing lines
Instead of using two loops with doseq, you could also use a loop/recur approach. What you expect to be parsing would be a second argument like this (untested):
(loop [lfile (lazy-file "my-file.txt")
parse-header true]
(let [line (first lfile)]
(if [and parse-header (head? line)]
(do (process-header line)
(recur (rest lfile) true))
(do (process-data line)
(recur (rest lfile) false)))))
There is another option here, which would be to incorporate your processing functions into your file reading function. So, instead of just consing a new line and returning it, you could just as well process it right away -- typically you could hand over the processing function as an argument instead of hard-coding it.
Your current code looks like processing is a side-effect. If so, you could then probably do away with the laziness if you incorporate the processing. You need to process the entire file anyway (or so it seems) and you do so on a per-line basis. The lazy-seq approach basically just aligns a single line read with a single processing call. Your need for laziness arises in the current solution because you separate reading (the entire file, line by line) from processing. If you instead move the processing of a line into the reading, you don't need to do that lazily.

Related

Clojure : why does this writer consume so much heap space?

I have a 700 mb XML file that I process from a records tree to an EDN file.
After having do all the processing, I finally have a lazy sequence of hashmaps that are not particularely big (at most 10 values).
To finish, I want to write it to a file with
(defn write-catalog [catalog-edn]
(with-open [wrtr (io/writer "catalog-fr.edn")]
(doseq [x catalog-edn]
(.write wrtr (prn-str x)))))
I do not understand the problem because doseq is supposed to do not retain the head of the sequence in memory.
My final output catalog is of type clojure.lang.LazySeq.
I then do
(write-catalog catalog)
Then memory usage is grinding and I have a GC overhead error at around 80mb of file writter with a XmX of 3g.
I tried also with a doseq + spit and no prn-str, same thing happen.
Is this a normal behaviour ?
Thanks
Possibly the memory leaks due to the catalog values realization (google "head retention"). When your write-catalog realizes items one by one, they are kept in memory (obviously you're def'fing catalog somewhere). To fix this you may try to avoid keeping your catalog in a variable, instead pass it to the write-catalog at once. Like if you parse it from somewhere (which i guess is true, considering your previous question), you would want to do:
(write-catalog (transform-catalog (get-catalog "mycatalog.xml")))
so huge intermediate sequences won't eat all your memory
Hope it helps.

couldn't use for loop in go block of core.async?

I'm new to clojure core.async library, and I'm trying to understand it through experiment.
But when I tried:
(let [i (async/chan)] (async/go (doall (for [r [1 2 3]] (async/>! i r)))))
it gives me a very strange exception:
CompilerException java.lang.IllegalArgumentException: No method in multimethod '-item-to-ssa' for dispatch value: :fn
and I tried another code:
(let [i (async/chan)] (async/go (doseq [r [1 2 3]] (async/>! i r))))
it have no compiler exception at all.
I'm totally confused. What happend?
So the Clojure go-block stops translation at function boundaries, for many reasons, but the biggest is simplicity. This is most commonly seen when constructing a lazy seq:
(go (lazy-seq (<! c)))
Gets compiled into something like this:
(go (clojure.lang.LazySeq. (fn [] (<! c))))
Now let's think about this real quick...what should this return? Assuming what you probably wanted was a lazy seq containing the value taken from c, but the <! needs to translate the remaining code of the function into a callback, but LazySeq is expecting the function to be synchronous. There really isn't a way around this limitation.
So back to your question if, you macroexpand for you'll see that it doesn't actually loop, instead it expands into a bunch of code that eventually calls lazy-seq and so parking ops don't work inside the body. doseq (and dotimes) however are backed by loop/recur and so those will work perfectly fine.
There are a few other places where this might trip you up with-bindings being one example. Basically if a macro sticks your core.async parking operations into a nested function, you'll get this error.
My suggestion then is to keep the body of your go blocks as simple as possible. Write pure functions, and then treat the body of go blocks as the places to do IO.
------------ EDIT -------------
By stops translation at function boundaries, I mean this: the go block takes its body and translates it into a state-machine. Each call to <! >! or alts! (and a few others) are considered state machine transitions where the execution of the block can pause. At each of those points the machine is turned into a callback and attached to the channel. When this macro reaches a fn form it stops translating. So you can only make calls to <! from inside a go block, not inside a function inside a code block.
This is part of the magic of core.async. Without the go macro, core.async code would look a lot like callback-hell in other langauges.

Clojure lazy-seq over Java iterative code

I'm trying to use create a Clojure seq from some iterative Java library code that I inherited. Basically what the Java code does is read records from a file using a parser, sends those records to a processor and returns an ArrayList of result. In Java this is done by calling parser.readData(), then parser.getRecord() to get a record then passing that record into processor.processRecord(). Each call to parser.readData() returns a single record or null if there are no more records. Pretty common pattern in Java.
So I created this next-record function in Clojure that will get the next record from a parser.
(defn next-record
"Get the next record from the parser and process it."
[parser processor]
(let [datamap (.readData parser)
row (.getRecord parser datamap)]
(if (nil? row)
nil
(.processRecord processor row 100))))
The idea then is to call this function and accumulate the records into a Clojure seq (preferably a lazy seq). So here is my first attempt which works great as long as there aren't too many records:
(defn datamap-seq
"Returns a lazy seq of the records using the given parser and processor"
[parser processor]
(lazy-seq
(when-let [records (next-record parser processor)]
(cons records (datamap-seq parser processor)))))
I can create a parser and processor, and do something like (take 5 (datamap-seq parser processor)) which gives me a lazy seq. And as expected getting the (first) of that seq only realizes one element, doing count realizes all of them, etc. Just the behavior I would expect from a lazy seq.
Of course when there are a lot of records I end up with a StackOverflowException. So my next attempt was to use loop-recur to do the same thing.
(defn datamap-seq
"Returns a lazy seq of the records using the given parser and processor"
[parser processor]
(lazy-seq
(loop [records (seq '())]
(if-let [record (next-record parser processor)]
(recur (cons record records))
records))))
Now using this the same way and defing it using (def results (datamap-seq parser processor)) gives me a lazy seq and doesn't realize any elements. However, as soon as I do anything else like (first results) it forces the realization of the entire seq.
Can anyone help me understand where I'm going wrong in the second function using loop-recur that causes it to realize the entire thing?
UPDATE:
I've looked a little closer at the stack trace from the exception and the stack overflow exception is being thrown from one of the Java classes. BUT it only happens when I have the datamap-seq function like this (the one I posted above actually does work):
(defn datamap-seq
"Returns a lazy seq of the records using the given parser and processor"
[parser processor]
(lazy-seq
(when-let [records (next-record parser processor)]
(cons records (remove empty? (datamap-seq parser processor))))))
I don't really understand why that remove causes problems, but when I take it out of this funciton it all works right (I'm doing the removal of empty lists somewhere else now).
loop/recur loops within the loop expression until the recursion runs out. adding a lazy-seq around it won't prevent that.
Your first attempt with lazy-seq / cons should already work as you want, without stack overflows. I can't spot right now what the problem with it is, though it might be in the java part of the code.
I'll post here addition to Joost's answer. This code:
(defn integers [start]
(lazy-seq
(cons
start
(integers (inc start)))))
will not throw StackOverflowExceptoin if I do something like this:
(take 5 (drop 1000000 (integers)))
EDIT:
Of course better way to do it would be to (iterate inc 0). :)
EDIT2:
I'll try to explain a little how lazy-seq works. lazy-seq is a macro that returns seq-like object. Combined with cons that doesn't realize its second argument until it is requested you get laziness.
Now take a look at how LazySeq class is implemented. LazySeq.sval triggers computation of the next value which returns another instance of "frozen" lazy sequence. Method LazySeq.seq even better shows mechanics behind the concept. Notice that to fully realize sequence it uses while loop. It in itself means that stack trace use is limited to short function calls that return another instances of LazySeq.
I hope this makes any sense. I described what I could deduce from the source code. Please let me know if I made any mistakes.

Using `line-seq` with `reader`, when is the file closed?

I'm reading lines from a text file using (line-seq (reader "input.txt")). This collection is then passed around and used by my program.
I'm concerned that this may be bad style however, as I'm not deterministically closing the file. I imagine that I can't use (with-open (line-seq (reader "input.txt"))), as the file stream will potentially get closed before I've traversed the entire sequence.
Should lazy-seq be avoided in conjunction with reader for files? Is there a different pattern I should be using here?
Since this doesn't really have a clear answer (it's all mixed into comments on the first answer), here's the essence of it:
(with-open [r (reader "input.txt")]
(doall (line-seq r)))
That will force the whole sequence of lines to be read and close the file. You can then pass the result of that whole expression around.
When dealing with large files, you may have memory problems (holding the whole sequence of lines in memory) and that's when it's a good idea to invert the program:
(with-open [r (reader "input.txt")]
(doall (my-program (line-seq r))))
You may or may not need doall in that case, depending on what my-program returns and/or whether my-program consumes the sequence lazily or not.
It seems like the clojure.contrib.duck-streams/read-lines is just what you are looking for. read-lines closes the file when there is no input and returns the sequence just like line-seq. Try to look at source code of read-lines.

Best Practice for globals in clojure, (refs vs alter-var-root)?

I've found myself using the following idiom lately in clojure code.
(def *some-global-var* (ref {}))
(defn get-global-var []
#*global-var*)
(defn update-global-var [val]
(dosync (ref-set *global-var* val)))
Most of the time this isn't even multi-threaded code that might need the transactional semantics that refs give you. It just feels like refs are for more than threaded code but basically for any global that requires immutability. Is there a better practice for this? I could try to refactor the code to just use binding or let but that can get particularly tricky for some applications.
I always use an atom rather than a ref when I see this kind of pattern - if you don't need transactions, just a shared mutable storage location, then atoms seem to be the way to go.
e.g. for a mutable map of key/value pairs I would use:
(def state (atom {}))
(defn get-state [key]
(#state key))
(defn update-state [key val]
(swap! state assoc key val))
Your functions have side effects. Calling them twice with the same inputs may give different return values depending on the current value of *some-global-var*. This makes things difficult to test and reason about, especially once you have more than one of these global vars floating around.
People calling your functions may not even know that your functions are depending on the value of the global var, without inspecting the source. What if they forget to initialize the global var? It's easy to forget. What if you have two sets of code both trying to use a library that relies on these global vars? They are probably going to step all over each other, unless you use binding. You also add overheads every time you access data from a ref.
If you write your code side-effect free, these problems go away. A function stands on its own. It's easy to test: pass it some inputs, inspect the outputs, they'll always be the same. It's easy to see what inputs a function depends on: they're all in the argument list. And now your code is thread-safe. And probably runs faster.
It's tricky to think about code this way if you're used to the "mutate a bunch of objects/memory" style of programming, but once you get the hang of it, it becomes relatively straightforward to organize your programs this way. Your code generally ends up as simple as or simpler than the global-mutation version of the same code.
Here's a highly contrived example:
(def *address-book* (ref {}))
(defn add [name addr]
(dosync (alter *address-book* assoc name addr)))
(defn report []
(doseq [[name addr] #*address-book*]
(println name ":" addr)))
(defn do-some-stuff []
(add "Brian" "123 Bovine University Blvd.")
(add "Roger" "456 Main St.")
(report))
Looking at do-some-stuff in isolation, what the heck is it doing? There are a lot of things happening implicitly. Down this path lies spaghetti. An arguably better version:
(defn make-address-book [] {})
(defn add [addr-book name addr]
(assoc addr-book name addr))
(defn report [addr-book]
(doseq [[name addr] addr-book]
(println name ":" addr)))
(defn do-some-stuff []
(let [addr-book (make-address-book)]
(-> addr-book
(add "Brian" "123 Bovine University Blvd.")
(add "Roger" "456 Main St.")
(report))))
Now it's clear what do-some-stuff is doing, even in isolation. You can have as many address books floating around as you want. Multiple threads could have their own. You can use this code from multiple namespaces safely. You can't forget to initialize the address book, because you pass it as an argument. You can test report easily: just pass the desired "mock" address book in and see what it prints. You don't have to care about any global state or anything but the function you're testing at the moment.
If you don't need to coordinate updates to a data structure from multiple threads, there's usually no need to use refs or global vars.