Clojure OutOfMemoryError - clojure

I'm reading about how lazy sequences can cause OutOfMemoryError's when using, say, loop/recur on large sequences. I'm trying to load in a 3MB file from memory to process it, and I think this is happening to me. But, I don't know if there's an idiomatic way to fix it. I tried putting in doall's, but then my program didn't seem to terminate. Small inputs work:
Small input (contents of file): AAABBBCCC
Correct output: ((65 65) (65 66) (66 66) (67 67) (67 67))
Code:
(def file-path "/Users/me/Desktop/temp/bob.txt")
;(def file-path "/Users/me/Downloads/3MB_song.m4a")
(def group-by-twos
(fn [a-list]
(let [first-two (fn [a-list] (list (take 2 a-list)))
the-rest-after-two (fn [a-list] (rest (rest a-list)))
only-two-left? (fn [a-list] (if (= (count a-list) 2) true false))]
(loop [result '() rest-of-list a-list]
(if (nil? rest-of-list)
result
(if (only-two-left? rest-of-list)
(concat result (list rest-of-list))
(recur (concat result (first-two rest-of-list))
(the-rest-after-two rest-of-list))))))))
(def get-the-file
(fn [file-name-and-path]
(let [the-file-pointer
(new java.io.RandomAccessFile (new java.io.File file-name-and-path) "r")
intermediate-array (byte-array (.length the-file-pointer))] ;reserve space for final length
(.readFully the-file-pointer intermediate-array)
(group-by-twos (seq intermediate-array)))))
(get-the-file file-path)
As I said above, when I put in doalls in a bunch of places, it didn't seem to finish. How can I get this to run for large files, and is there a way to get rid of the cognitive burden of doing whatever I need to do? Some rule?

You are reading the file completely in memory and then creating a seq on this byte array which doesn't really give you any benefit of lazy sequence as all the data required is already loaded in memory and lazy sequence really means that produce/generate data when it is required.
What you can do is create the seq over the file content using something like:
(def get-the-file
(fn [file-name-and-path]
(let [the-file-pointer
(new java.io.RandomAccessFile (new java.io.File file-name-and-path) "r")
file-len (.length the-file-pointer)] ;get file len
(partition­ 2 (map (fn [_] (.readByte the-file-pointer)) (range file-len))))))
NOTE: I haven't really tried it but I hope it gives you the idea at least about the lazy file reading part

I guess an idiomatic solution would be:
(partition 2 (map int (slurp "/Users/me/Desktop/temp/bob.txt")))
This is not fully lazy as the full file is loaded into memory, but it should work without problems for files that are not too big. However partition and map are lazy so if you replace slurp by a buffered reader you will get a fully lazy version.
Note: this will swallow the last char if the size of the file is odd. It is not clear what you expect if the size is odd. If you want to have the last value in its own list, you can use (partition 2 2 [] ... )
user=> (partition 2 (map int "ABCDE"))
((65 66) (67 68))
user=> (partition 2 2 [] (map int "ABCDE"))
((65 66) (67 68) (69))

Beware of clojure data structures when dealing with large amounts of data. (typical Clojure app uses two to three times as much memory than the same Java application - sequences are memory expensive).
If you can read the whole data into an array, do that. Then process it while making sure you don't keep reference to any sequence head to ensure garbage collection happens during the process.
Also strings are much bigger than char primitives. Single char string is 26 bytes and char is 2 bytes.
Even if you don't like using arrays, arraylist is several times smaller than a sequence or a vector.

Related

Making Clojure let statement more functional

In my first Clojure project everything turned out nicely except this part at the end:
(let [broken-signs (->> (:symbols field)
(map make-sign)
(filter broken?))
broken-count (count broken-signs)
unfixable-count (-> (filter (complement fixable?) broken-signs)
(count))]
(println
(if (> unfixable-count 0)
-1
(- broken-count unfixable-count))))
The indentation looks off and it does not feel functional, since I am reusing state in the let block. I basically count the number of broken signs and then the number of fixable signs. If any sign is unfixable, I print -1, otherwise I print the number of signs to be fixed.
If I mapped/filtered twice, I'd have duplicate code, but most of all it'd run slower. Is there a way of improving this code nevertheless?
EDIT: This is what I settled on
(defn count-broken-yet-fixable []
(let [broken (->> (:symbols field)
(map make-sign)
(filter broken?))
unfixable (remove fixable? broken)]
(when (empty? unfixable)
(count broken))))
(defn solve-task []
(if-let [result (count-broken-yet-fixable)]
result
-1))
(println (solve-task))
The subtraction was indeed not necessary and the count did not have to happen in the let block. Outputting the -1 on bad input also isn't the job of the function and is only part of the task.
I don't think there's anything "non-functional" or wrong with your approach. The indentation looks fine.
(let [broken (->> (:symbols field)
(map make-sign)
(filter broken?))
unfixable (remove fixable? broken)]
(when (seq unfixable)
(- (count broken) (count unfixable))))
You could replace filter (complement with remove
Could use pos? instead of (> n 0)
I might put two printlns inside the if, but really it's better to return a value
You could inline the broken-count binding, since it's only used in one place
I personally think this is easier to read with less threading macros
Since the need to count unfixables is conditional, you could test for values with seq first
If you return -1 as a sentinel value, I'd use nil instead; this happens naturally when the when condition isn't met
The conditional logic seems backwards: you return -1 when unfixable-count is positive, and only use its value when it's not positive (which means it's zero b/c count can't be negative), e.g. it could be rewritten as (- broken-count 0) and then just broken-count
(let [broken-signs (->> (:symbols field)
(map make-sign)
(filter broken?))]
(if-let [unfixable-signs (seq (remove fixable? broken-signs))]
-1
(- (count broken-signs) (count unfixable-signs)))
Your code is already pretty neat and well laid out. The only change I would want to do, is stay in the domain as long as possible - in this case the signs objects. And use the counts only much later.
The return value can then use print to do the actual actions.
Your code doesn't have side-effects (except for printing, which we'll fix), so I wouldn't say it's not functional. let is designed for building up intermediate values. In comments are some potential improvements.
; align here (controversial)
(let [broken-signs (->> (:symbols field)
(map make-sign)
(filter broken?))
broken-count (count broken-signs)
unfixable-count (->> broken-signs ; maybe emphasize non-function start
(filter (complement fixable?))
count)] ; no parens needed
;; don't print; just return number
(if (< 0 unfixable-count) ; prefer less-than (Elements of Clojure book)
-1
(- broken-count unfixable-count)))
I highly recommend the Clojure Style Guide for related suggestions.

Why does this code report a stack overflow in Clojure

This is a simple attempt to reproduce some code that Ross Ihaka gave as an example of poor R performance. I was curious as to whether Clojure's persistent data structures would offer any improvement.
(https://www.stat.auckland.ac.nz/~ihaka/downloads/JSM-2010.pdf)
However , I'm not even getting to first base, with a Stack Overflow reported, and not much else to go by. Any ideas? Apologies in advance if the question has an obvious answer I've missed...
; Testing Ross Ihaka's example of poor R performance
; against Clojure, to see if persisntent data structures help
(def dd (repeat 60000 '(0 0 0 0)))
(defn repl-row [d i new-r]
(concat (take (dec i) d) (list new-r) (drop i d)))
(defn changerows [d new-r]
(loop [i 10000
data d]
(if (zero? i)
data
(let [j (rand-int 60000)
newdata (repl-row data j new-r)]
(recur (dec i) newdata)))))
user=> (changerows dd '(1 2 3 4))
StackOverflowError clojure.lang.Numbers.isPos (Numbers.java:96)
Further, if anyone has any ideas how persistent functional data structures can be used to best advantage in the example above, I'd be very keen to hear. The speedup reported not using immutable structures (link above) was about 500%!
Looking at the stack trace for the StackOverflowError, this seems to be an "exploding thunk" (lazy/suspended calculation) problem that isn't obviously related to the recursion in your example:
java.lang.StackOverflowError
at clojure.lang.RT.seq(RT.java:528)
at clojure.core$seq__5124.invokeStatic(core.clj:137)
at clojure.core$concat$cat__5217$fn__5218.invoke(core.clj:726)
at clojure.lang.LazySeq.sval(LazySeq.java:40)
at clojure.lang.LazySeq.seq(LazySeq.java:49)
at clojure.lang.RT.seq(RT.java:528)
at clojure.core$seq__5124.invokeStatic(core.clj:137)
at clojure.core$take$fn__5630.invoke(core.clj:2876)
Changing this line to realize newdata into a vector resolves the issue:
(recur (dec i) (vec newdata))
This workaround is to address the use of concat in repl-row, by forcing concat's lazy sequence to be realized in each step. concat returns lazy sequences, and in your loop/recur you're passing in the lazy/unevaluated results of previous concat calls as input to subsequent concat calls which returns more lazy sequences based on previous, unrealized lazy sequences. The final concat-produced lazy sequence isn't realized until the loop finishes, which results in a stack overflow due its dependence on thousands of previous concat-produced lazy sequences.
Further, if anyone has any ideas how persistent functional data structures can be used to best advantage in the example above, I'd be very keen to hear.
Since it seems the usage of concat here is to simply replace an element in a collection, we can get the same effect by using a vector and assoc-ing the new item into the correct position of the vector:
(def dd (vec (repeat 60000 '(0 0 0 0))))
(defn changerows [d new-r]
(loop [i 10000
data d]
(if (zero? i)
data
(let [j (rand-int 60000)
newdata (assoc data j new-r)]
(recur (dec i) newdata)))))
Notice there's no more repl-row function, we just assoc into data using the index and the new value. After some rudimentary benchmarking with time, this approach seems to be many times faster:
"Elapsed time: 75836.412166 msecs" ;; original sample w/fixed overflow
"Elapsed time: 2.984481 msecs" ;; using vector+assoc instead of concat
And here's another way to solve it by viewing the series of replacements as an infinite sequence of replacement steps, then sampling from that sequence:
(defn random-replace [replacement coll]
(assoc coll (rand-int (count coll)) replacement))
(->> (iterate (partial random-replace '(1 2 3 4)) dd)
(drop 10000) ;; could also use `nth` function
(first))

streaming multiple output lines from a single input line using Clojure data.csv

I am reading a csv file, processing the input, appending the output to the input and writing the results an output csv. Seems pretty straight-forward. I am using Clojure data.csv. However, I ran into a nuance in the output that does not fit with anything I've run into before with Clojure and I cannot figure it out. The output will contain 0 to N lines for each input, and I cannot figure out how to stream this down to the calling fn.
Here is the form that is processing the file:
(defn process-file
[from to]
(let [ctr (atom 0)]
(with-open [r (io/reader from)
w (io/writer to)]
(some->> (csv/read-csv r)
(map #(process-line % ctr))
(csv/write-csv w)))))
And here is the form that processes each line (that returns 0 to N lines that each need to be written to the output csv):
(defn process-line
[line ctr]
(swap! ctr inc)
(->> (apps-for-org (first line))
(reduce #(conj %1 (add-results-to-input line %2)) [])))
Honestly, I didn't fully understand your question, but from your comment, I seem to have answered it.
If you want to run csv/write-csv for each row returned, you could just map over the rows:
(some->> (csv/read-csv r)
(map #(process-line % ctr))
(mapv #(csv/write-csv w %))))
Note my use of mapv since you're running side effects. Depending on the context, if you used just map, the laziness may prevent the writes from happening.
It would be arguably more correct to use doseq however:
(let [rows (some->> (csv/read-csv r)
(map #(process-line % ctr)))]
(doseq [row rows]
(csv/write-csv w row)))
doseq makes it clear that the intent of the iteration is to carry out side effects and not to produce a new (immutable) list.
I think the problem you're running into is that your map function process-line returns a collection of zero-to-many rows, so when you map over the input lines you get a collection of collections, when you just want one collection (of rows) to send to write-csv. If that's true, the fix is simply changing this line to use mapcat:
(mapcat #(process-line % ctr))
mapcat is like map, but it will "merge" the resulting collections of each process-line call into a single collection.

How to add multiple counts of large lists

If I do a simple version
(+ (count [1 2 3 4]) (count [1 2 3 4]))
I get the correct answer 8
But when I use the large scale version from my program that could potentially have count equaling 100,000 it no longer functions. combatLog is a 100,000 line log file.
(let [rdr (BufferedReader. (FileReader. combatLog))]
(+ (count (filter (comp not nil?) (take 100000 (repeatedly #(re-seq #"Kebtiz hits" (str (.readLine rdr)))))))
(count (filter (comp not nil?) (take 100000 (repeatedly #(re-seq #"Kebtiz misses" (str (.readLine rdr)))))))
)
)
In this case, it returns only the value of the first count. I am trying to figure out either why + and count aren't working in this case, or another way to sum the total number of elements in both lists.
In your code you are reading from the same reader in two places. It seems that the first line consumes the whole reader and the second one gets no lines to filter. Notice that each call to .readLine moves the position in the input file.
I guess you wanted to do something like:
(with-open [reader (clojure.java.io/reader combatLog]
(->> reader
(line-seq)
(filter #(re-seq #"Kebtiz hits|Kebtiz misses"))
(count)))
Using with-open will make sure your file handle will be closed and resources won't leak. I also combined your two separate regular expressions into one.

I have two versions of a function to count leading hash(#) characters, which is better?

I wrote a piece of code to count the leading hash(#) character of a line, which is much like a heading line in Markdown
### Line one -> return 3
######## Line two -> return 6 (Only care about the first 6 characters.
Version 1
(defn
count-leading-hash
[line]
(let [cnt (count (take-while #(= % \#) line))]
(if (> cnt 6) 6 cnt)))
Version 2
(defn
count-leading-hash
[line]
(loop [cnt 0]
(if (and (= (.charAt line cnt) \#) (< cnt 6))
(recur (inc cnt))
cnt)))
I used time to measure both tow implementations, found that the first version based on take-while is 2x faster than version 2. Taken "###### Line one" as input, version 1 took 0.09 msecs, version 2 took about 0.19 msecs.
Question 1. Is it recur that slows down the second implementation?
Question 2. Version 1 is closer to functional programming paradigm , is it?
Question 3. Which one do you prefer? Why? (You're welcome to write your own implementation.)
--Update--
After reading the doc of cloujure, I came up with a new version of this function, and I think it's much clear.
(defn
count-leading-hash
[line]
(->> line (take 6) (take-while #(= \# %)) count))
IMO it isn't useful to take time measurements for small pieces of code
Yes, version 1 is more functional
I prefer version 1 because it is easier to spot errors
I prefer version 1 because it is less code, thus less cost to maintain.
I would write the function like this:
(defn count-leading-hash [line]
(count (take-while #{\#} (take 6 line))))
No, it's the reflection used to invoke .charAt. Call (set! *warn-on-reflection* true) before creating the function, and you'll see the warning.
Insofar as it uses HOFs, sure.
The first, though (if (> cnt 6) 6 cnt) is better written as (min 6 cnt).
1: No. recur is pretty fast. For every function you call, there is a bit of overhead and "noise" from the VM: the REPL needs to parse and evaluate your call for example, or some garbage collection might happen. That's why benchmarks on such tiny bits of code don't mean anything.
Compare with:
(defn
count-leading-hash
[line]
(let [cnt (count (take-while #(= % \#) line))]
(if (> cnt 6) 6 cnt)))
(defn
count-leading-hash2
[line]
(loop [cnt 0]
(if (and (= (.charAt line cnt) \#) (< cnt 6))
(recur (inc cnt))
cnt)))
(def lines ["### Line one" "######## Line two"])
(time (dorun (repeatedly 10000 #(dorun (map count-leading-hash lines)))))
;; "Elapsed time: 620.628 msecs"
;; => nil
(time (dorun (repeatedly 10000 #(dorun (map count-leading-hash2 lines)))))
;; "Elapsed time: 592.721 msecs"
;; => nil
No significant difference.
2: Using loop/recur is not idiomatic in this instance; it's best to use it only when you really need it and use other available functions when you can. There are many useful functions that operate on collections/sequences; check ClojureDocs for a reference and examples. In my experience, people with imperative programming skills who are new to functional programming use loop/recur a lot more than those who have a lot of Clojure experience; loop/recur can be a code smell.
3: I like the first version better. There are lots of different approaches:
;; more expensive, because it iterates n times, where n is the number of #'s
(defn count-leading-hash [line]
(min 6 (count (take-while #(= \# %) line))))
;; takes only at most 6 characters from line, so less expensive
(defn count-leading-hash [line]
(count (take-while #(= \# %) (take 6 line))))
;; instead of an anonymous function, you can use `partial`
(defn count-leading-hash [line]
(count (take-while (partial = \#) (take 6 line))))
edit:
How to decide when to use partial vs an anonymous function?
In terms of performance it doesn't matter, because (partial = \#) evaluates to (fn [& args] (apply = \# args)). #(= \# %) translates to (fn [arg] (= \# arg)). Both are very similar, but partial gives you a function that accepts an arbitrary number of arguments, so in situations where you need it, that's the way to go. partial is the λ (lambda) in lambda calculus. I'd say, use what's easier to read, or partial if you need a function with an arbitrary number of arguments.
Micro-benchmarks on the JVM are almost always misleading, unless you really know what you're doing. So, I wouldn't put too much weight on the relative performance of your two solutions.
The first solution is more idiomatic. You only really see explicit loop/recur in Clojure code when it's the only reasonable alternative. In this case, clearly, there is a reasonable alternative.
Another option, if you're comfortable with regular expressions:
(defn count-leading-hash [line]
(count (or (re-find #"^#{1,6}" line) "")))