Optimizing merge-with in Clojure - clojure

The following works fine for small lists (< 500), but it hangs indefinetely for larger lists (> 2500). Is there a better way to achieve this effect without failing?
(def errors '({:a-key
["some string"]}
{:a-key
["some string"]}
{:a-key
["some string"]}
{:a-key
["some other string"]}))
(def unique-errors (apply merge-with (comp distinct into) errors))
;; => {:a-key ("some string", "some other string")}

The main problem with your code, I think, is not that it's slow, it's that it leads to a stack overflow, because distinct is called once for every new error, and it's lazy, so when printing the result, there are a lot of "nested" distinct calls.
But anyway: Use sets for things that should not contain duplicates. Using sets leads to the following, which is a little bit faster and does not lead to stack overflows.
(def errors (repeat 5000 {:a-key #{"some string"}}))
(apply merge-with into errors)

Related

Functional alternative to "let"

I find myself writing a lot of clojure in this manner:
(defn my-fun [input]
(let [result1 (some-complicated-procedure input)
result2 (some-other-procedure result1)]
(do-something-with-results result1 result2)))
This let statement seems very... imperative. Which I don't like. In principal, I could be writing the same function like this:
(defn my-fun [input]
(do-something-with-results (some-complicated-procedure input)
(some-other-procedure (some-complicated-procedure input)))))
The problem with this is that it involves recomputation of some-complicated-procedure, which may be arbitrarily expensive. Also you can imagine that some-complicated-procedure is actually a series of nested function calls, and then I either have to write a whole new function, or risk that changes in the first invocation don't get applied to the second:
E.g. this works, but I have to have an extra shallow, top-level function that makes it hard to do a mental stack trace:
(defn some-complicated-procedure [input] (lots (of (nested (operations input)))))
(defn my-fun [input]
(do-something-with-results (some-complicated-procedure input)
(some-other-procedure (some-complicated-procedure input)))))
E.g. this is dangerous because refactoring is hard:
(defn my-fun [input]
(do-something-with-results (lots (of (nested (operations (mistake input))))) ; oops made a change here that wasn't applied to the other nested calls
(some-other-procedure (lots (of (nested (operations input))))))))
Given these tradeoffs, I feel like I don't have any alternatives to writing long, imperative let statements, but when I do, I cant shake the feeling that I'm not writing idiomatic clojure. Is there a way I can address the computation and code cleanliness problems raised above and write idiomatic clojure? Are imperitive-ish let statements idiomatic?
The kind of let statements you describe might remind you of imperative code, but there is nothing imperative about them. Haskell has similar statements for binding names to values within bodies, too.
If your situation really needs a bigger hammer, there are some bigger hammers that you can either use or take for inspiration. The following two libraries offer some kind of binding form (akin to let) with a localized memoization of results, so as to perform only the necessary steps and reuse their results if needed again: Plumatic Plumbing, specifically the Graph part; and Zach Tellman's Manifold, whose let-flow form furthermore orchestrates asynchronous steps to wait for the necessary inputs to become available, and to run in parallel when possible. Even if you decide to maintain your present course, their docs make good reading, and the code of Manifold itself is educational.
I recently had this same question when I looked at this code I wrote
(let [user-symbols (map :symbol states)
duplicates (for [[id freq] (frequencies user-symbols) :when (> freq 1)] id)]
(do-something-with duplicates))
You'll note that map and for are lazy and will not be executed until do-something-with is executed. It's also possible that not all (or even not any) of the states will be mapped or the frequencies calculated. It depends on what do-something-with actually requests of the sequence returned by for. This is very much functional and idiomatic functional programming.
i guess the simplest approach to keep it functional would be to have a pass-through state to accumulate the intermediate results. something like this:
(defn with-state [res-key f state]
(assoc state res-key (f state)))
user> (with-state :res (comp inc :init) {:init 10})
;;=> {:init 10, :res 11}
so you can move on to something like this:
(->> {:init 100}
(with-state :inc'd (comp inc :init))
(with-state :inc-doubled (comp (partial * 2) :inc'd))
(with-state :inc-doubled-squared (comp #(* % %) :inc-doubled))
(with-state :summarized (fn [st] (apply + (vals st)))))
;;=> {:init 100,
;; :inc'd 101,
;; :inc-doubled 202,
;; :inc-doubled-squared 40804,
;; :summarized 41207}
The let form is a perfectly functional construct and can be seen as syntactic sugar for calls to anonymous functions. We can easily write a recursive macro to implement our own version of let:
(defmacro my-let [bindings body]
(if (empty? bindings)
body
`((fn [~(first bindings)]
(my-let ~(rest (rest bindings)) ~body))
~(second bindings))))
Here is an example of calling it:
(my-let [a 3
b (+ a 1)]
(* a b))
;; => 12
And here is a macroexpand-all called on the above expression, that reveal how we implement my-let using anonymous functions:
(clojure.walk/macroexpand-all '(my-let [a 3
b (+ a 1)]
(* a b)))
;; => ((fn* ([a] ((fn* ([b] (* a b))) (+ a 1)))) 3)
Note that the expansion doesn't rely on let and that the bound symbols become parameter names in the anonymous functions.
As others write, let is actually perfectly functional, but at times it can feel imperative. It's better to become fully comfortable with it.
You might, however, want to kick the tires of my little library tl;dr that lets you write code like for example
(compute
(+ a b c)
where
a (f b)
c (+ 100 b))

How to parallelize Clojure keep function?

I'm trying to parallelize the function below. I refactored this from a for statement and implemented pmap to speed up reading the xml data, which went well. The next bottleneck is in my keep statement. How can I improve performance here?
I've tried (keep #(when (pmap #(later-date? (second %) after) zip) [(first %) (second %)]) zip) but nested #() functions are not allowed. I've also tried wrapping the keep as well as the call to raw-url-data in a future but dereferencing either in the calling function produces nil.
(defn- raw-url-data
"Parse xmlzip data and return a sequence of URLs/modtime vectors."
[data after]
(let [article (xz/xml-> data :url)
loc (pmap #(-> (xz/xml-> % :loc xz/text) first) article)
mod (pmap #(-> (xz/xml-> % :lastmod xz/text) first
parse-modtime) article)
zip (zipmap loc mod)]
(keep #(when (later-date? (second %) after)
[(first %) (second %)]) zip)))
And here is my later-date? function:
(defn- later-date?
"Return TRUE if DATETIME is after AFTER or either one is NIL."
[datetime after]
(or (nil? datetime)
(nil? after)
(time/after? datetime after)))
With this type of problem getting the time spent splitting the data up for parallel processing and then putting it back together to be less than the time to process it in a sequence can be tricky.
In the problem above, if i'm interpreting it correctly you are generating two sequences of data, each in parallel. So these sequences can't communicate with each other during this process to see if they have a later date. Once all of the data for both sequences is finished then you form it into a map. and then split that map back into a sequence and start processing it.
The first pair of dates, (first loc) and (first mob), will be sitting for quite a while before they can be compared to see if they should go into the final result. so the best speedup may come from simply removing the call to zipmap.
time/after? is very fast so you will almost certainly loose time by calling pmap here, though it's good to know how to do it anyway. You can get aroung the inability of the anonymous function macro to handle nested anonymous functions by making one of tham a call to fn like so:
(keep (fn [x] (when (pmap #(later-date? (second x) after) zip)) [(first %) (second %)])
Another approach is to
break it into partitions,
do all the processing on each partition, and
merge them back together.
Then adjust the partition size until you see a benefit over the splitting costs.
This has been discussed here, and here

Clojure confusion - behavior of map, doseq in a multiprocess environment

In trying to replicate some websockets examples I've run into some behavior I don't understand and can't seem to find documentation for. Simplified, here's an example I'm running in lein that's supposed to run a function for every element in a shared map once per second:
(def clients (atom {"a" "b" "c" "d" }))
(def ticker-agent (agent nil))
(defn execute [a]
(println "execute")
(let [ keys (keys #clients) ]
(println "keys= " keys )
(doseq [ x keys ] (println x)))
;(map (fn [k] (println k)) keys)) ;; replace doseq with this?
(Thread/sleep 1000)
(send *agent* execute))
(defn -main [& args]
(send ticker-agent execute)
)
If I run this with map I get
execute
keys= (a c)
execute
keys= (a c)
...
First confusing issue: I understand that I'm likely using map incorrectly because there's no return value, but does that mean the inner println is optimized away? Especially given that if I run this in a repl:
(map #(println %) '(1 2 3))
it works fine?
Second question - if I run this with doseq instead of map I can run into conditions where the execution agent stops (which I'd append here, but am having difficulty isolating/recreating). Clearly there's something I"m missing possibly relating to locking on the maps keyset? I was able to do this even moving the shared map out of an atom. Is there default syncrhonization on the clojure map?
map is lazy. This means that it does not calculate any result until the result is accessed from the data structure it reteruns. This means that it will not run anything if its result is not used.
When you use map from the repl the print stage of the repl accesses the data, which causes any side effects in your mapped function to be invoked. Inside a function, if the return value is not investigated, any side effects in the mapping function will not occur.
You can use doall to force full evaluation of a lazy sequence. You can use dorun if you don't need the result value but want to ensure all side effects are invoked. Also you can use mapv which is not lazy (because vectors are never lazy), and gives you an associative data structure, which is often useful (better random access performance, optimized for appending rather than prepending).
Edit: Regarding the second part of your question (moving this here from a comment).
No, there is nothing about doseq that would hang your execution, try checking the agent-error status of your agent to see if there is some exception, because agents stop executing and stop accepting new tasks by default if they hit an error condition. You can also use set-error-model and set-error-handler! to customize the agent's error handling behavior.

Clojure lazy sequences in math.combinatorics results in OutOfMemory (OOM) Error

The documentation of math.combinatorics states that all functions return lazy sequences.
However if I try to run subsets with a lot of data,
(last (combinatorics/subsets (range 20)))
;OutOfMemoryError Java heap space clojure.lang.RT.cons (RT.java:559)
I get an OutOfMemory Error.
Running
(last (range))
burns CPU, but it doesn't return an error.
Clojure doesn't seem to "hold on the head" like explained in this Stack Overflow question.
Why is this happening and how I can use bigger ranges in subsets?
Update
It seems to work on some peoples computers as the comments suggest. So I will post my system configuration
I run a Mac (10.8.3) and installed Clojure (1.5.1) with Homebrew.
My Java version is:
% java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06-451-11M4406)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01-451, mixed mode)
I didn't change any of the default settings. I also reinstalled all dependencies, by deleting the ~/.m2 folder.
My projects.clj.
And the command I used was this
% lein repl
nREPL server started on port 61774
REPL-y 0.1.10
Clojure 1.5.1
=> (require 'clojure.math.combinatorics)
nil
=> (last (clojure.math.combinatorics/subsets (range 20)))
OutOfMemoryError Java heap space clojure.lang.RT.cons (RT.java:570)
or
OutOfMemoryError Java heap space clojure.math.combinatorics/index-combinations/fn--1148/step--1164 (combinatorics.clj:64)
I tested the problem on a colleague's laptop, and he had the same issue, but he was on a Mac, too.
The issue is that subsets uses mapcat, and mapcat is not lazy enough as it uses apply which realizes and holds some of the elements to be concatenated. See a very nice explanation here. Using the lazier mapcat version of that link in subsets should fix the issue:
(defn my-mapcat
[f coll]
(lazy-seq
(if (not-empty coll)
(concat
(f (first coll))
(my-mapcat f (rest coll))))))
(defn subsets
"All the subsets of items"
[items]
(my-mapcat (fn [n] (clojure.math.combinatorics/combinations items n))
(range (inc (count items)))))
(last (subsets (range 50))) ;; this will take hours to compute, good luck with it!
You want to compute the power set of a set with 1000 elements? You know that's going to have 2^1000 elements, right? That is so large I can't even find a good way to describe how enormous it is. If you're trying to work with such a set, and you can do so lazily, your problem won't be memory: it will be computation time. Let's say you have a supercomputer with infinite memory, capable of processing a trillion items per nanosecond: that's 10^21 items processed per second, or about 10^29 items per year. Even this supercomputer will take much, much longer than the lifetime of the universe to work through the items of (subsets (range 1000)).
So I'd say, stop worrying about the memory usage of this collection, and work on an algorithm that doesn't involve walking through sequences with more elements than there are atoms in the universe.
The problem is neither with apply, nor with concat, nor with mapcat.
dAni's answer, where he reimplements mapcat, does accidentally results in fixing the problem, but the reasoning behind it is not correct. Also, his answer points to an article, where the author says "I believe the problem lies in apply". This is clearly wrong, as I am about to explain below. Finally, the issue at hand is not related to this other one, where non-lazy evaluation is indeed caused by apply.
If you look closely, both dAni and the author of that article implement mapcat without the use of the map function. I will show in the next example that the issue is related to the way the map function is implemented.
To demonstrate that the issue is not related to either apply or concat see the following implementation of mapcat. It uses both concat and apply, still it achieves full laziness:
(defn map
([f coll]
(lazy-seq
(when-let [s (seq coll)]
(cons (f (first s)) (map f (rest s)))))))
(defn mapcat [f & colls]
(apply concat (apply map f colls)))
(defn range-announce! [x]
(do (println "Returning sequence of" x)
(range x)))
;; new fully lazy implementation prints only 5 lines
(nth (mapcat range-announce! (range)) 5)
;; clojure.core version still prints 32 lines
(nth (clojure.core/mapcat range-announce! (range)) 5)
The full laziness in the above code is achieved by reimplementing the map function. In fact mapcat is implemented exactly the same way as in clojure.core, yet it works fully lazy. The above map implementation is a bit simplified for the sake of the example, as it only supports a single parameter, but even implementing it with the whole variadic signature will work the same: full laziness. So we showed that the problem here is neither with apply nor with concat. Also, we showed that the real problem must be related to how the map function is implemented in clojure.core. Let's take a look at it:
(defn map
([f coll]
(lazy-seq
(when-let [s (seq coll)]
(if (chunked-seq? s)
(let [c (chunk-first s)
size (int (count c))
b (chunk-buffer size)]
(dotimes [i size]
(chunk-append b (f (.nth c i))))
(chunk-cons (chunk b) (map f (chunk-rest s))))
(cons (f (first s)) (map f (rest s))))))))
It can be seen that the clojure.core implementation is exactly the same as our "simplified" version before, except for the true branch of the if (chunked-seq? s) expression. Essentially clojure.core/map has a special case for handling input sequences which are chunked sequences.
Chunked sequences compromise laziness by evaluating in chunks of 32 instead of strictly one at a time. This becomes painfully evident when evaluating deeply nested chunked sequences, like in the case of subsets. Chunked sequences were introduced in Clojure 1.1, and many core functions were upgraded to recognize and process them differently, including map. The main purpose of introducing them was to improve performance in certain stream-processing scenarios, but arguably they make it significantly harder to reason about the laziness characteristics of a program. You can read up on chunked sequences here and here. Also check out this question here.
The real problem is that range returns a chunked seq, and is used internally by subsets. The fix recommended by David James patches subsets to unchunk the sequence created by range internally.
This issue has been raised on the project's ticket tracker: Clojure JIRA: OutOfMemoryError with combinatorics/subsets. There, you can find a patch by Andy Fingerhut. It worked for me. Note that the patch is different than the mapcat variation suggested by another answer.
In the absence of command line arguments, the startup heap size parameters of a JVM are determined by various ergonomics
The defaults (JDK 6) are
initial heap size memory / 64
maximum heap size MIN(memory / 4, 1GB)
but you can force an absolute value using the -Xmx and -Xms args
You can find more detail here

Should not a tail-recursive function also be faster?

I have the following Clojure code to calculate a number with a certain "factorable" property. (what exactly the code does is secondary).
(defn factor-9
([]
(let [digits (take 9 (iterate #(inc %) 1))
nums (map (fn [x] ,(Integer. (apply str x))) (permutations digits))]
(some (fn [x] (and (factor-9 x) x)) nums)))
([n]
(or
(= 1 (count (str n)))
(and (divisible-by-length n) (factor-9 (quot n 10))))))
Now, I'm into TCO and realize that Clojure can only provide tail-recursion if explicitly told so using the recur keyword. So I've rewritten the code to do that (replacing factor-9 with recur being the only difference):
(defn factor-9
([]
(let [digits (take 9 (iterate #(inc %) 1))
nums (map (fn [x] ,(Integer. (apply str x))) (permutations digits))]
(some (fn [x] (and (factor-9 x) x)) nums)))
([n]
(or
(= 1 (count (str n)))
(and (divisible-by-length n) (recur (quot n 10))))))
To my knowledge, TCO has a double benefit. The first one is that it does not use the stack as heavily as a non tail-recursive call and thus does not blow it on larger recursions. The second, I think is that consequently it's faster since it can be converted to a loop.
Now, I've made a very rough benchmark and have not seen any difference between the two implementations although. Am I wrong in my second assumption or does this have something to do with running on the JVM (which does not have automatic TCO) and recur using a trick to achieve it?
Thank you.
The use of recur does speed things up, but only by about 3 nanoseconds (really) over a recursive call. When things get that small they can be hidden in the noise of the rest of the test. I wrote four tests (link below) that are able to illustrate the difference in performance.
I'd also suggest using something like criterium when benchmarking. (Stack Overflow won't let me post with more than 1 link since I've got no reputation to speak of, so you'll have to google it, maybe "clojure criterium")
For formatting reasons, I've put the tests and results in this gist.
Briefly, to compare relatively, if the recursive test is 1, then the looping test is about 0.45, and the TCO tests about 0.87 and the absolute difference between the recursive and TCO tests are around 3ns.
Of course, all the caveats about benchmarking apply.
When optimizing any code, it's good to start from potential or actual bottlenecks and optimize that first.
It seems to me that this particular piece of code is eating most of your CPU time:
(map (fn [x] ,(Integer. (apply str x))) (permutations digits))
And that doesn't depend on TCO in any way - it is executed in same way. So, tail call in this particular example will allow you not to use up all the stack, but to achieve better performance, try optimizing this.
just a gentile reminder that clojure has no TCO
After evaluating factor-9 (quot n 10) an and and an or has to be evaluated before the function can return. Thus it is not tail-recursive.