In Clojure, are lazy seqs always chunked? - clojure

I was under the impression that the lazy seqs were always chunked.
=> (take 1 (map #(do (print \.) %) (range)))
(................................0)
As expected 32 dots are printed because the lazy seq returned by range is chunked into 32 element chunks. However, when instead of range I try this with my own function get-rss-feeds, the lazy seq is no longer chunked:
=> (take 1 (map #(do (print \.) %) (get-rss-feeds r)))
(."http://wholehealthsource.blogspot.com/feeds/posts/default")
Only one dot is printed, so I guess the lazy-seq returned by get-rss-feeds is not chunked. Indeed:
=> (chunked-seq? (seq (range)))
true
=> (chunked-seq? (seq (get-rss-feeds r)))
false
Here is the source for get-rss-feeds:
(defn get-rss-feeds
"returns a lazy seq of urls of all feeds; takes an html-resource from the enlive library"
[hr]
(map #(:href (:attrs %))
(filter #(rss-feed? (:type (:attrs %))) (html/select hr [:link])))
So it appears that chunkiness depends on how the lazy seq is produced. I peeked at the source for the function range and there are hints of it being implemented in a "chunky" manner. So I'm a bit confused as to how this works. Can someone please clarify?
Here's why I need to know.
I have to following code: (get-rss-entry (get-rss-feeds h-res) url)
The call to get-rss-feeds returns a lazy sequence of URLs of feeds that I need to examine.
The call to get-rss-entry looks for a particular entry (whose :link field matches the second argument of get-rss-entry). It examines the lazy sequence returned by get-rss-feeds. Evaluating each item requires an http request across the network to fetch a new rss feed. To minimize the number of http requests it's important to examine the sequence one-by-one and stop as soon as there is a match.
Here is the code:
(defn get-rss-entry
[feeds url]
(ffirst (drop-while empty? (map #(entry-with-url % url) feeds))))
entry-with-url returns a lazy sequence of matches or an empty sequence if there is no match.
I tested this and it seems to work correctly (evaluating one feed url at a time). But I am worried that somewhere, somehow it will start behaving in a "chunky" way and it will start evaluating 32 feeds at a time. I know there is a way to avoid chunky behavior as discussed here, but it doesn't seem to even be required in this case.
Am I using lazy seq non-idiomatically? Would loop/recur be a better option?

You are right to be concerned. Your get-rss-entry will indeed call entry-with-url more than strictly necessary if the feeds parameter is a collection that returns chunked seqs. For example if feeds is a vector, map will operate on whole chunks at a time.
This problem is addressed directly in Fogus' Joy of Clojure, with the function seq1 defined in chapter 12:
(defn seq1 [s]
(lazy-seq
(when-let [[x] (seq s)]
(cons x (seq1 (rest s))))))
You could use this right where you know you want the most laziness possible, right before you call entry-with-url:
(defn get-rss-entry
[feeds url]
(ffirst (drop-while empty? (map #(entry-with-url % url) (seq1 feeds)))))

Lazy seqs are not always chunked - it depends on how they are produced.
For example, the lazy seq produced by this function is not chunked:
(defn integers-from [n]
(lazy-seq (cons n (do (print \.) (integers-from (inc n))))))
(take 3 (integers-from 3))
=> (..3 .4 5)
But many other clojure built-in functions do produce chunked seqs for performance reasons (e.g. range)

Depending on the vagueness of Chunking seems unwise as you mention above. Explicitly "un chunking" in cases where you really need it not to be chunked is also wise because then if at some other point your code changes in a way that chunkifies it things wont break. On another note, if you need actions to be sequential, agents are a great tool you could send the download functions to an agent then they will be run one at a time and only once regardless of how you evaluate the function. At some point you may want to pmap your sequence and then even un-chunking will not work though using an atom will continue to work correctly.

I have discussed this recently in Can I un-chunk lazy sequences to realize one element at a time? and the conclusion is that if you need to control when items are produced/consumed, you should not use lazy sequences.
For processing you can use transducers, where you control when the next item is processed.
For producing the elements, the ideal approach is to reify ISeq. A practical approach is to use lazy-seq with a single cons call in it whose rest is a recursive call. But notice that this relies on an implementation detail of lazy-seq.

Related

how to spec a lazy-seq generating function?

I wish to use spec in my pre and post conditions of a generator function. A simplified example of what I wish to do is described below:
(defn positive-numbers
([]
{:post [(s/valid? (s/+ int?) %)]}
(positive-numbers 1))
([n]
{:post [(s/valid? (s/+ int?) %)]}
(lazy-seq (cons n (positive-numbers (inc n))))))
(->> (positive-numbers) (take 5))
However, defining the generator function like that seems to cause stack-overflow, the cause being that spec will eagerly try to evaluate the whole thing, -or something like that....
Is there another way of using spec to describe the :post result of a generator function like the one above (without causing stack-overflow)?
The theoretically correct answer is that in general you cannot check whether a lazy sequence matches a spec without realizing all of it.
In the case of your specific example of (s/+ int?), given a lazy sequence, how would one establish merely by observing the sequence whether all its elements are integers? However many elements you examine, the next one could always be a keyword.
This is the sort of thing that a type system like, say, core.typed may be able to prove, but a runtime-predicate-based assertion won't be able to check.
Now, in addition to s/+ and s/*, spec (as of Clojure 1.9.0-alpha14) also has a a combinator called s/every, whose docstring says this:
Note that 'every' does not do exhaustive checking, rather it samples *coll-check-limit* elements.
So we have e.g.
(s/valid? (s/* int?) (concat (range 1000) [:foo]))
;= false
but
(s/valid? (s/every int?) (concat (range 1000) [:foo]))
;= true
(with the default *coll-check-limit* value of 101).
This actually isn't an immediate fix to your example – plugging in s/every in place of s/+ won't work, because each recursive call will want to validate its own return value, which will involve realizing more of the sequence, which will involve more recursive calls etc. But you could factor out the sequence-building logic to a helper function with no postconditions and then have positive-numbers declare the postcondition and call that helper function:
(defn positive-numbers* [n]
(lazy-seq (cons n (positive-numbers* (inc n)))))
(defn positive-numbers [n]
{:post [(s/valid? (s/every int? :min-count 1) %)]}
(positive-numbers* n))
Note the caveats:
this will still realize a good chunk of your sequence, which may wreak havoc with your application's performance profile;
the only watertight guarantee here is that the prefix actually examined is as desired, if the seq has a weird item at position 123456, that will go unnoticed.
Because of (1), this is something that makes more sense as a test-only assertion. (2) may be acceptable – you'll still catch some silly typos and the documentation value of the spec is there anyway; if it isn't and you do want an absolutely watertight guarantee that your return type is as desired, then again, core.typed (perhaps used locally just for a handful of namespaces) may be the better bet.

Clojure lazy sequences in math.combinatorics results in OutOfMemory (OOM) Error

The documentation of math.combinatorics states that all functions return lazy sequences.
However if I try to run subsets with a lot of data,
(last (combinatorics/subsets (range 20)))
;OutOfMemoryError Java heap space clojure.lang.RT.cons (RT.java:559)
I get an OutOfMemory Error.
Running
(last (range))
burns CPU, but it doesn't return an error.
Clojure doesn't seem to "hold on the head" like explained in this Stack Overflow question.
Why is this happening and how I can use bigger ranges in subsets?
Update
It seems to work on some peoples computers as the comments suggest. So I will post my system configuration
I run a Mac (10.8.3) and installed Clojure (1.5.1) with Homebrew.
My Java version is:
% java -version
java version "1.6.0_45"
Java(TM) SE Runtime Environment (build 1.6.0_45-b06-451-11M4406)
Java HotSpot(TM) 64-Bit Server VM (build 20.45-b01-451, mixed mode)
I didn't change any of the default settings. I also reinstalled all dependencies, by deleting the ~/.m2 folder.
My projects.clj.
And the command I used was this
% lein repl
nREPL server started on port 61774
REPL-y 0.1.10
Clojure 1.5.1
=> (require 'clojure.math.combinatorics)
nil
=> (last (clojure.math.combinatorics/subsets (range 20)))
OutOfMemoryError Java heap space clojure.lang.RT.cons (RT.java:570)
or
OutOfMemoryError Java heap space clojure.math.combinatorics/index-combinations/fn--1148/step--1164 (combinatorics.clj:64)
I tested the problem on a colleague's laptop, and he had the same issue, but he was on a Mac, too.
The issue is that subsets uses mapcat, and mapcat is not lazy enough as it uses apply which realizes and holds some of the elements to be concatenated. See a very nice explanation here. Using the lazier mapcat version of that link in subsets should fix the issue:
(defn my-mapcat
[f coll]
(lazy-seq
(if (not-empty coll)
(concat
(f (first coll))
(my-mapcat f (rest coll))))))
(defn subsets
"All the subsets of items"
[items]
(my-mapcat (fn [n] (clojure.math.combinatorics/combinations items n))
(range (inc (count items)))))
(last (subsets (range 50))) ;; this will take hours to compute, good luck with it!
You want to compute the power set of a set with 1000 elements? You know that's going to have 2^1000 elements, right? That is so large I can't even find a good way to describe how enormous it is. If you're trying to work with such a set, and you can do so lazily, your problem won't be memory: it will be computation time. Let's say you have a supercomputer with infinite memory, capable of processing a trillion items per nanosecond: that's 10^21 items processed per second, or about 10^29 items per year. Even this supercomputer will take much, much longer than the lifetime of the universe to work through the items of (subsets (range 1000)).
So I'd say, stop worrying about the memory usage of this collection, and work on an algorithm that doesn't involve walking through sequences with more elements than there are atoms in the universe.
The problem is neither with apply, nor with concat, nor with mapcat.
dAni's answer, where he reimplements mapcat, does accidentally results in fixing the problem, but the reasoning behind it is not correct. Also, his answer points to an article, where the author says "I believe the problem lies in apply". This is clearly wrong, as I am about to explain below. Finally, the issue at hand is not related to this other one, where non-lazy evaluation is indeed caused by apply.
If you look closely, both dAni and the author of that article implement mapcat without the use of the map function. I will show in the next example that the issue is related to the way the map function is implemented.
To demonstrate that the issue is not related to either apply or concat see the following implementation of mapcat. It uses both concat and apply, still it achieves full laziness:
(defn map
([f coll]
(lazy-seq
(when-let [s (seq coll)]
(cons (f (first s)) (map f (rest s)))))))
(defn mapcat [f & colls]
(apply concat (apply map f colls)))
(defn range-announce! [x]
(do (println "Returning sequence of" x)
(range x)))
;; new fully lazy implementation prints only 5 lines
(nth (mapcat range-announce! (range)) 5)
;; clojure.core version still prints 32 lines
(nth (clojure.core/mapcat range-announce! (range)) 5)
The full laziness in the above code is achieved by reimplementing the map function. In fact mapcat is implemented exactly the same way as in clojure.core, yet it works fully lazy. The above map implementation is a bit simplified for the sake of the example, as it only supports a single parameter, but even implementing it with the whole variadic signature will work the same: full laziness. So we showed that the problem here is neither with apply nor with concat. Also, we showed that the real problem must be related to how the map function is implemented in clojure.core. Let's take a look at it:
(defn map
([f coll]
(lazy-seq
(when-let [s (seq coll)]
(if (chunked-seq? s)
(let [c (chunk-first s)
size (int (count c))
b (chunk-buffer size)]
(dotimes [i size]
(chunk-append b (f (.nth c i))))
(chunk-cons (chunk b) (map f (chunk-rest s))))
(cons (f (first s)) (map f (rest s))))))))
It can be seen that the clojure.core implementation is exactly the same as our "simplified" version before, except for the true branch of the if (chunked-seq? s) expression. Essentially clojure.core/map has a special case for handling input sequences which are chunked sequences.
Chunked sequences compromise laziness by evaluating in chunks of 32 instead of strictly one at a time. This becomes painfully evident when evaluating deeply nested chunked sequences, like in the case of subsets. Chunked sequences were introduced in Clojure 1.1, and many core functions were upgraded to recognize and process them differently, including map. The main purpose of introducing them was to improve performance in certain stream-processing scenarios, but arguably they make it significantly harder to reason about the laziness characteristics of a program. You can read up on chunked sequences here and here. Also check out this question here.
The real problem is that range returns a chunked seq, and is used internally by subsets. The fix recommended by David James patches subsets to unchunk the sequence created by range internally.
This issue has been raised on the project's ticket tracker: Clojure JIRA: OutOfMemoryError with combinatorics/subsets. There, you can find a patch by Andy Fingerhut. It worked for me. Note that the patch is different than the mapcat variation suggested by another answer.
In the absence of command line arguments, the startup heap size parameters of a JVM are determined by various ergonomics
The defaults (JDK 6) are
initial heap size memory / 64
maximum heap size MIN(memory / 4, 1GB)
but you can force an absolute value using the -Xmx and -Xms args
You can find more detail here

Clojure flatten and laziness

Not sure what is the behaviour I observe while using flatten when constructing a lazy sequence.
Looking at the source in clojure.core I can see that the flatten function makes a call to filter and hence should return a lazy sequence - I think. And yet the following snippet gives me a stackoverflow error. In the snippet when the call to flatten is replaced with a call to concat, it works just fine
(defn l-f [c]
(if (nil? c) []
(lazy-seq (flatten (cons [[ :h :j] :a :B] (l-f (rest c)))))))
(take 10 (l-f (repeat 2))) is how I invoke it.
This is a rather contrived example. Also I am aware that flatten and concat will give me sequences where the nesting levels are different.
I am trying to figure out why flatten seems to break the laziness, even though my (limited) understanding of the code in clojure.core suggests otherwise.
Laziness only takes you so far - laziness just means that the sequence isn't fully realized at the time that it's created, but building one lazy sequence from another sometimes involves looking ahead a few values. In this case, the implementation of flatten doesn't play nicely with the recursive way in which you're calling it.
First, the flatten function calls tree-seq to do a depth-first traversal of the contents of the collection. In turn, tree-seq calls mapcat with the provided sequence, which delegates to apply, which realizes the first few items in the sequence to determine the arity of the function to invoke. Realizing the first few items in the sequence causes a recursive call to l-f, which calls flatten on the remaining arguments, and gets stuck in an infinite loop.
In this particular situation, there's no need to call flatten recursively, because any call after the first will have no effect. So your function can be fixed by separating out the generation of the lazy sequence from the flattening of it:
(defn l-f [c]
(letfn [(l-f-seq [x] (if-let [s (seq x)]
(lazy-seq (cons [[:h :j] :a :B] (l-f-seq (rest s))))
[]))]
(flatten (l-f-seq c))))

build set lazily in clojure

I've started to learn clojure but I'm having trouble wrapping my mind around certain concepts. For instance, what I want to do here is to take this function and convert it so that it calls get-origlabels lazily.
(defn get-all-origlabels []
(set (flatten (map get-origlabels (range *song-count*)))))
My first attempt used recursion but blew up the stack (song-count is about 10,000). I couldn't figure out how to do it with tail recursion.
get-origlabels returns a set each time it is called, but values are often repeated between calls. What the get-origlabels function actually does is read a file (a different file for each value from 0 to song-count-1) and return the words stored in it in a set.
Any pointers would be greatly appreciated!
Thanks!
-Philip
You can get what you want by using mapcat. I believe putting it into an actual Clojure set is going to de-lazify it, as demonstrated by the fact that (take 10 (set (iterate inc 0))) tries to realize the whole set before taking 10.
(distinct (mapcat get-origlabels (range *song-count*)))
This will give you a lazy sequence. You can verify that by doing something like, starting with an infinite sequence:
(->> (iterate inc 0)
(mapcat vector)
(distinct)
(take 10))
You'll end up with a seq, rather than a set, but since it sounds like you really want laziness here, I think that's for the best.
This may have better stack behavior
(defn get-all-origlabels []
(reduce (fn (s x) (union s (get-origlabels x))) ${} (range *song-count*)))
I'd probably use something like:
(into #{} (mapcat get-origlabels (range *song-count*)))
In general, "into" is very helpful when constructing Clojure data structures. I have this mental image of a conveyor belt (a sequence) dropping a bunch of random objects into a large bucket (the target collection).

How to convert lazy sequence to non-lazy in Clojure

I tried the following in Clojure, expecting to have the class of a non-lazy sequence returned:
(.getClass (doall (take 3 (repeatedly rand))))
However, this still returns clojure.lang.LazySeq. My guess is that doall does evaluate the entire sequence, but returns the original sequence as it's still useful for memoization.
So what is the idiomatic means of creating a non-lazy sequence from a lazy one?
doall is all you need. Just because the seq has type LazySeq doesn't mean it has pending evaluation. Lazy seqs cache their results, so all you need to do is walk the lazy seq once (as doall does) in order to force it all, and thus render it non-lazy. seq does not force the entire collection to be evaluated.
This is to some degree a question of taxonomy. a lazy sequence is just one type of sequence as is a list, vector or map. So the answer is of course "it depends on what type of non lazy sequence you want to get:
Take your pick from:
an ex-lazy (fully evaluated) lazy sequence (doall ... )
a list for sequential access (apply list (my-lazy-seq)) OR (into () ...)
a vector for later random access (vec (my-lazy-seq))
a map or a set if you have some special purpose.
You can have whatever type of sequence most suites your needs.
This Rich guy seems to know his clojure and is absolutely right.
Buth I think this code-snippet, using your example, might be a useful complement to this question :
=> (realized? (take 3 (repeatedly rand)))
false
=> (realized? (doall (take 3 (repeatedly rand))))
true
Indeed type has not changed but realization has
I stumbled on this this blog post about doall not being recursive. For that I found the first comment in the post did the trick. Something along the lines of:
(use 'clojure.walk)
(postwalk identity nested-lazy-thing)
I found this useful in a unit test where I wanted to force evaluation of some nested applications of map to force an error condition.
(.getClass (into '() (take 3 (repeatedly rand))))