How to add multiple counts of large lists - clojure

If I do a simple version
(+ (count [1 2 3 4]) (count [1 2 3 4]))
I get the correct answer 8
But when I use the large scale version from my program that could potentially have count equaling 100,000 it no longer functions. combatLog is a 100,000 line log file.
(let [rdr (BufferedReader. (FileReader. combatLog))]
(+ (count (filter (comp not nil?) (take 100000 (repeatedly #(re-seq #"Kebtiz hits" (str (.readLine rdr)))))))
(count (filter (comp not nil?) (take 100000 (repeatedly #(re-seq #"Kebtiz misses" (str (.readLine rdr)))))))
)
)
In this case, it returns only the value of the first count. I am trying to figure out either why + and count aren't working in this case, or another way to sum the total number of elements in both lists.

In your code you are reading from the same reader in two places. It seems that the first line consumes the whole reader and the second one gets no lines to filter. Notice that each call to .readLine moves the position in the input file.
I guess you wanted to do something like:
(with-open [reader (clojure.java.io/reader combatLog]
(->> reader
(line-seq)
(filter #(re-seq #"Kebtiz hits|Kebtiz misses"))
(count)))
Using with-open will make sure your file handle will be closed and resources won't leak. I also combined your two separate regular expressions into one.

Related

streaming multiple output lines from a single input line using Clojure data.csv

I am reading a csv file, processing the input, appending the output to the input and writing the results an output csv. Seems pretty straight-forward. I am using Clojure data.csv. However, I ran into a nuance in the output that does not fit with anything I've run into before with Clojure and I cannot figure it out. The output will contain 0 to N lines for each input, and I cannot figure out how to stream this down to the calling fn.
Here is the form that is processing the file:
(defn process-file
[from to]
(let [ctr (atom 0)]
(with-open [r (io/reader from)
w (io/writer to)]
(some->> (csv/read-csv r)
(map #(process-line % ctr))
(csv/write-csv w)))))
And here is the form that processes each line (that returns 0 to N lines that each need to be written to the output csv):
(defn process-line
[line ctr]
(swap! ctr inc)
(->> (apps-for-org (first line))
(reduce #(conj %1 (add-results-to-input line %2)) [])))
Honestly, I didn't fully understand your question, but from your comment, I seem to have answered it.
If you want to run csv/write-csv for each row returned, you could just map over the rows:
(some->> (csv/read-csv r)
(map #(process-line % ctr))
(mapv #(csv/write-csv w %))))
Note my use of mapv since you're running side effects. Depending on the context, if you used just map, the laziness may prevent the writes from happening.
It would be arguably more correct to use doseq however:
(let [rows (some->> (csv/read-csv r)
(map #(process-line % ctr)))]
(doseq [row rows]
(csv/write-csv w row)))
doseq makes it clear that the intent of the iteration is to carry out side effects and not to produce a new (immutable) list.
I think the problem you're running into is that your map function process-line returns a collection of zero-to-many rows, so when you map over the input lines you get a collection of collections, when you just want one collection (of rows) to send to write-csv. If that's true, the fix is simply changing this line to use mapcat:
(mapcat #(process-line % ctr))
mapcat is like map, but it will "merge" the resulting collections of each process-line call into a single collection.

clojure refactor code from recursion

I have the following bit of code that produces the correct results:
(ns scratch.core
(require [clojure.string :as str :only (split-lines join split)]))
(defn numberify [str]
(vec (map read-string (str/split str #" "))))
(defn process [acc sticks]
(let [smallest (apply min sticks)
cuts (filter #(> % 0) (map #(- % smallest) sticks))]
(if (empty? cuts)
acc
(process (conj acc (count cuts)) cuts))))
(defn print-result [[x & xs]]
(prn x)
(if (seq xs)
(recur xs)))
(let [input "8\n1 2 3 4 3 3 2 1"
lines (str/split-lines input)
length (read-string (first lines))
inputs (first (rest lines))]
(print-result (process [length] (numberify inputs))))
The process function above recursively calls itself until the sequence sticks is empty?.
I am curious to know if I could have used something like take-while or some other technique to make the code more succinct?
If ever I need to do some work on a sequence until it is empty then I use recursion but I can't help thinking there is a better way.
Your core problem can be described as
stop if count of sticks is zero
accumulate count of sticks
subtract the smallest stick from each of sticks
filter positive sticks
go back to 1.
Identify the smallest sub-problem as steps 3 and 4 and put a box around it
(defn cuts [sticks]
(let [smallest (apply min sticks)]
(filter pos? (map #(- % smallest) sticks))))
Notice that sticks don't change between steps 5 and 3, that cuts is a fn sticks->sticks, so use iterate to put a box around that:
(defn process [sticks]
(->> (iterate cuts sticks)
;; ----- 8< -------------------
This gives an infinite seq of sticks, (cuts sticks), (cuts (cuts sticks)) and so on
Incorporate step 1 and 2
(defn process [sticks]
(->> (iterate cuts sticks)
(map count) ;; count each sticks
(take-while pos?))) ;; accumulate while counts are positive
(process [1 2 3 4 3 3 2 1])
;-> (8 6 4 1)
Behind the scene this algorithm hardly differs from the one you posted, since lazy seqs are a delayed implementation of recursion. It is more idiomatic though, more modular, uses take-while for cancellation which adds to its expressiveness. Also it doesn't require one to pass the initial count and does the right thing if sticks is empty. I hope it is what you were looking for.
I think the way your code is written is a very lispy way of doing it. Certainly there are many many examples in The Little Schema that follow this format of reduction/recursion.
To replace recursion, I usually look for a solution that involves using higher order functions, in this case reduce. It replaces the min calls each iteration with a single sort at the start.
(defn process [sticks]
(drop-last (reduce (fn [a i]
(let [n (- (last a) (count i))]
(conj a n)))
[(count sticks)]
(partition-by identity (sort sticks)))))
(process [1 2 3 4 3 3 2 1])
=> (8 6 4 1)
I've changed the algorithm to fit reduce by grouping the same numbers after sorting, and then counting each group and reducing the count size.

Reducing memory usage in a simple Clojure program

I'm trying to solve the Fibonacci problem on codeeval. At first I wrote it in the usual recursive way and, although I got the right output, I failed the test since it used ~70MB of memory and the usage limit is 20MB. I found an approximation formula and rewrote it to use that thinking it was the heavy stack usage that was causing me to exceed the limit. However, there doesn't seem to be any reduction.
(ns fibonacci.core
(:gen-class))
(use 'clojure.java.io)
(def phi (/ (+ 1 (Math/sqrt 5)) 2))
(defn parse-int
"Find the first integer in the string"
[int-str]
(Integer. (re-find #"\d+" int-str)))
(defn readfile
"Read in a file of integers separated by newlines and return them as a list"
[filename]
(with-open [rdr (reader filename)]
(reverse (map parse-int (into '() (line-seq rdr))))))
(defn fibonacci
"Calculate the Fibonacci number using an approximatation of Binet's Formula. From
http://www.maths.surrey.ac.uk/hosted-sites/R.Knott/Fibonacci/fibFormula.html"
[term]
(Math/round (/ (Math/pow phi term) (Math/sqrt 5))))
(defn -main
[& args]
(let [filename (first args)
terms (readfile filename)]
(loop [terms terms]
((comp println fibonacci) (first terms))
(if (not (empty? (rest terms)))
(recur (rest terms))))))
(-main (first *command-line-args*))
Sample input:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
50
Sample output:
0
1
1
2
3
5
8
13
21
34
55
89
144
233
12586269025
Their input is clearly much larger than this and I don't get to see it. How can I modify this code to use dramatically less memory?
Edit: It's impossible. Codeeval knows about the problem and is working on it. See here.
Codeeval is broken for Clojure. There are no accepted Clojure solutions listed and there is a message from the company two months ago saying they are still working on the problem.
Darn.
Presumably the problem is that the input file is very large, and your readfile function is creating the entire list of lines in memory.
The way to avoid that is to process a single line at a time, and not hold on to the whole sequence, something like:
(defn -main [& args]
(let [filename (first args)]
(with-open [rdr (reader filename)]
(doseq [line (line-seq rdr)]
(-> line parse-int fibonacci println)))))
Are you getting a stack overflow error? If so, Clojure supports arbitrary precision, so it might help to try using /' instead of /. The apostrophe versions of +, -, * and / are designed to coerce numbers into BigInts to prevent stack overflow.
See this question.

I have two versions of a function to count leading hash(#) characters, which is better?

I wrote a piece of code to count the leading hash(#) character of a line, which is much like a heading line in Markdown
### Line one -> return 3
######## Line two -> return 6 (Only care about the first 6 characters.
Version 1
(defn
count-leading-hash
[line]
(let [cnt (count (take-while #(= % \#) line))]
(if (> cnt 6) 6 cnt)))
Version 2
(defn
count-leading-hash
[line]
(loop [cnt 0]
(if (and (= (.charAt line cnt) \#) (< cnt 6))
(recur (inc cnt))
cnt)))
I used time to measure both tow implementations, found that the first version based on take-while is 2x faster than version 2. Taken "###### Line one" as input, version 1 took 0.09 msecs, version 2 took about 0.19 msecs.
Question 1. Is it recur that slows down the second implementation?
Question 2. Version 1 is closer to functional programming paradigm , is it?
Question 3. Which one do you prefer? Why? (You're welcome to write your own implementation.)
--Update--
After reading the doc of cloujure, I came up with a new version of this function, and I think it's much clear.
(defn
count-leading-hash
[line]
(->> line (take 6) (take-while #(= \# %)) count))
IMO it isn't useful to take time measurements for small pieces of code
Yes, version 1 is more functional
I prefer version 1 because it is easier to spot errors
I prefer version 1 because it is less code, thus less cost to maintain.
I would write the function like this:
(defn count-leading-hash [line]
(count (take-while #{\#} (take 6 line))))
No, it's the reflection used to invoke .charAt. Call (set! *warn-on-reflection* true) before creating the function, and you'll see the warning.
Insofar as it uses HOFs, sure.
The first, though (if (> cnt 6) 6 cnt) is better written as (min 6 cnt).
1: No. recur is pretty fast. For every function you call, there is a bit of overhead and "noise" from the VM: the REPL needs to parse and evaluate your call for example, or some garbage collection might happen. That's why benchmarks on such tiny bits of code don't mean anything.
Compare with:
(defn
count-leading-hash
[line]
(let [cnt (count (take-while #(= % \#) line))]
(if (> cnt 6) 6 cnt)))
(defn
count-leading-hash2
[line]
(loop [cnt 0]
(if (and (= (.charAt line cnt) \#) (< cnt 6))
(recur (inc cnt))
cnt)))
(def lines ["### Line one" "######## Line two"])
(time (dorun (repeatedly 10000 #(dorun (map count-leading-hash lines)))))
;; "Elapsed time: 620.628 msecs"
;; => nil
(time (dorun (repeatedly 10000 #(dorun (map count-leading-hash2 lines)))))
;; "Elapsed time: 592.721 msecs"
;; => nil
No significant difference.
2: Using loop/recur is not idiomatic in this instance; it's best to use it only when you really need it and use other available functions when you can. There are many useful functions that operate on collections/sequences; check ClojureDocs for a reference and examples. In my experience, people with imperative programming skills who are new to functional programming use loop/recur a lot more than those who have a lot of Clojure experience; loop/recur can be a code smell.
3: I like the first version better. There are lots of different approaches:
;; more expensive, because it iterates n times, where n is the number of #'s
(defn count-leading-hash [line]
(min 6 (count (take-while #(= \# %) line))))
;; takes only at most 6 characters from line, so less expensive
(defn count-leading-hash [line]
(count (take-while #(= \# %) (take 6 line))))
;; instead of an anonymous function, you can use `partial`
(defn count-leading-hash [line]
(count (take-while (partial = \#) (take 6 line))))
edit:
How to decide when to use partial vs an anonymous function?
In terms of performance it doesn't matter, because (partial = \#) evaluates to (fn [& args] (apply = \# args)). #(= \# %) translates to (fn [arg] (= \# arg)). Both are very similar, but partial gives you a function that accepts an arbitrary number of arguments, so in situations where you need it, that's the way to go. partial is the λ (lambda) in lambda calculus. I'd say, use what's easier to read, or partial if you need a function with an arbitrary number of arguments.
Micro-benchmarks on the JVM are almost always misleading, unless you really know what you're doing. So, I wouldn't put too much weight on the relative performance of your two solutions.
The first solution is more idiomatic. You only really see explicit loop/recur in Clojure code when it's the only reasonable alternative. In this case, clearly, there is a reasonable alternative.
Another option, if you're comfortable with regular expressions:
(defn count-leading-hash [line]
(count (or (re-find #"^#{1,6}" line) "")))

Clojure OutOfMemoryError

I'm reading about how lazy sequences can cause OutOfMemoryError's when using, say, loop/recur on large sequences. I'm trying to load in a 3MB file from memory to process it, and I think this is happening to me. But, I don't know if there's an idiomatic way to fix it. I tried putting in doall's, but then my program didn't seem to terminate. Small inputs work:
Small input (contents of file): AAABBBCCC
Correct output: ((65 65) (65 66) (66 66) (67 67) (67 67))
Code:
(def file-path "/Users/me/Desktop/temp/bob.txt")
;(def file-path "/Users/me/Downloads/3MB_song.m4a")
(def group-by-twos
(fn [a-list]
(let [first-two (fn [a-list] (list (take 2 a-list)))
the-rest-after-two (fn [a-list] (rest (rest a-list)))
only-two-left? (fn [a-list] (if (= (count a-list) 2) true false))]
(loop [result '() rest-of-list a-list]
(if (nil? rest-of-list)
result
(if (only-two-left? rest-of-list)
(concat result (list rest-of-list))
(recur (concat result (first-two rest-of-list))
(the-rest-after-two rest-of-list))))))))
(def get-the-file
(fn [file-name-and-path]
(let [the-file-pointer
(new java.io.RandomAccessFile (new java.io.File file-name-and-path) "r")
intermediate-array (byte-array (.length the-file-pointer))] ;reserve space for final length
(.readFully the-file-pointer intermediate-array)
(group-by-twos (seq intermediate-array)))))
(get-the-file file-path)
As I said above, when I put in doalls in a bunch of places, it didn't seem to finish. How can I get this to run for large files, and is there a way to get rid of the cognitive burden of doing whatever I need to do? Some rule?
You are reading the file completely in memory and then creating a seq on this byte array which doesn't really give you any benefit of lazy sequence as all the data required is already loaded in memory and lazy sequence really means that produce/generate data when it is required.
What you can do is create the seq over the file content using something like:
(def get-the-file
(fn [file-name-and-path]
(let [the-file-pointer
(new java.io.RandomAccessFile (new java.io.File file-name-and-path) "r")
file-len (.length the-file-pointer)] ;get file len
(partition­ 2 (map (fn [_] (.readByte the-file-pointer)) (range file-len))))))
NOTE: I haven't really tried it but I hope it gives you the idea at least about the lazy file reading part
I guess an idiomatic solution would be:
(partition 2 (map int (slurp "/Users/me/Desktop/temp/bob.txt")))
This is not fully lazy as the full file is loaded into memory, but it should work without problems for files that are not too big. However partition and map are lazy so if you replace slurp by a buffered reader you will get a fully lazy version.
Note: this will swallow the last char if the size of the file is odd. It is not clear what you expect if the size is odd. If you want to have the last value in its own list, you can use (partition 2 2 [] ... )
user=> (partition 2 (map int "ABCDE"))
((65 66) (67 68))
user=> (partition 2 2 [] (map int "ABCDE"))
((65 66) (67 68) (69))
Beware of clojure data structures when dealing with large amounts of data. (typical Clojure app uses two to three times as much memory than the same Java application - sequences are memory expensive).
If you can read the whole data into an array, do that. Then process it while making sure you don't keep reference to any sequence head to ensure garbage collection happens during the process.
Also strings are much bigger than char primitives. Single char string is 26 bytes and char is 2 bytes.
Even if you don't like using arrays, arraylist is several times smaller than a sequence or a vector.