Clojure flatten and laziness - clojure

Not sure what is the behaviour I observe while using flatten when constructing a lazy sequence.
Looking at the source in clojure.core I can see that the flatten function makes a call to filter and hence should return a lazy sequence - I think. And yet the following snippet gives me a stackoverflow error. In the snippet when the call to flatten is replaced with a call to concat, it works just fine
(defn l-f [c]
(if (nil? c) []
(lazy-seq (flatten (cons [[ :h :j] :a :B] (l-f (rest c)))))))
(take 10 (l-f (repeat 2))) is how I invoke it.
This is a rather contrived example. Also I am aware that flatten and concat will give me sequences where the nesting levels are different.
I am trying to figure out why flatten seems to break the laziness, even though my (limited) understanding of the code in clojure.core suggests otherwise.

Laziness only takes you so far - laziness just means that the sequence isn't fully realized at the time that it's created, but building one lazy sequence from another sometimes involves looking ahead a few values. In this case, the implementation of flatten doesn't play nicely with the recursive way in which you're calling it.
First, the flatten function calls tree-seq to do a depth-first traversal of the contents of the collection. In turn, tree-seq calls mapcat with the provided sequence, which delegates to apply, which realizes the first few items in the sequence to determine the arity of the function to invoke. Realizing the first few items in the sequence causes a recursive call to l-f, which calls flatten on the remaining arguments, and gets stuck in an infinite loop.
In this particular situation, there's no need to call flatten recursively, because any call after the first will have no effect. So your function can be fixed by separating out the generation of the lazy sequence from the flattening of it:
(defn l-f [c]
(letfn [(l-f-seq [x] (if-let [s (seq x)]
(lazy-seq (cons [[:h :j] :a :B] (l-f-seq (rest s))))
[]))]
(flatten (l-f-seq c))))

Related

What scope should calls to lazy-seq have?

I'm writing a lazy implementation of the Recamán's Sequence, and ran into some confusion regarding where calls to lazy-seq should happen.
This first version I came up with this morning was:
(defn lazy-recamans-sequence []
(let [f (fn rec [n seen last-s]
(let [back (- last-s n)
new-s (if (and (pos? back) (not (seen back)))
back
(+ last-s n))]
(lazy-seq ; Here
(cons new-s (rec (inc n) (conj seen new-s) new-s)))))]
(f 0 #{} 0)))
Then I realized that my placement of lazy-seq was kind of arbitrary, and that it could be placed higher to wrap more of the computations:
(defn lazy-recamans-sequence2 []
(let [f (fn rec [n seen last-s]
(lazy-seq ; Here
(let [back (- last-s n)
new-s (if (and (pos? back) (not (seen back)))
back
(+ last-s n))]
(cons new-s (rec (inc n) (conj seen new-s) new-s)))))]
(f 0 #{} 0)))
Then I looked back on a review that someone gave me last night:
(defn recaman []
(letfn [(tail [previous n seen]
(let [nx (if (and (> previous n) (not (seen (- previous n))))
(- previous n)
(+ previous n))]
; Here, inside "cons"
(cons nx (lazy-seq (tail nx (inc n) (conj seen nx))))))]
(tail 0 0 #{})))
And they have theirs inside of the call to cons!
Thinking this over, it seems like it wouldn't make a difference. With a broader scope (like the second version), more code is inside the explicit function that's passed to LazySeq. With a narrower scope however, the function itself may be smaller, but since the passed function involves a recursive call, it will be executing the same code anyways.
They seem to preform nearly identically and give the same answers. Is there any reason to prefer placing lazy-seq in one place over another? Is this simply a stylistic choice, or can this have actual repercussions?
In the first two examples the lazy-seq wraps the cons call. This means that when you generate call the function you return a lazy sequence immediately without calculating the first item of the sequence.
In the first example the let expression is still outside of lazy-seq so the value of the first item is calculated immediately but the returned sequence is still lazy and not realized.
The second example is similar to the first. The lazy-seq wraps the cons cell and also the let block. This means that the function will return immediatetly and the value of the first item is calculated only when the caller starts to consume the lazy sequence.
In the third example the value of the first item in the list is calculated immediately and only the tail of the returned sequence is lazy.
Is there any reason to prefer placing lazy-seq in one place over another?
It depends on what you want to achieve. Do you want to return a sequence immediately without calculating any values? In this case make the scope of lazy-seq as broad as possible. Otherwise try to restrict the scope of lazy-seq to calculate only the tail part of the sequence.
When I was first learning Clojure, I was a bit confused by the many possible choices of lazy-seq constructs, the lack of clarity in terms of which construct to choose, and the somewhat vague explanation for how lazy-seq creates laziness in the first place (it is implemented as a Java class of ~240 lines).
To reduce repetition and keep things as simple as possible, I created the lazy-cons macro. It is used like so:
(defn lazy-countdown [n]
(when (<= 0 n)
(lazy-cons n (lazy-countdown (dec n)))))
(deftest t-all
(is= (lazy-countdown 5) [5 4 3 2 1 0] )
(is= (lazy-countdown 1) [1 0] )
(is= (lazy-countdown 0) [0] )
(is= (lazy-countdown -1) nil ))
This version does realize the initial value n immediately.
I never worry about chunking (typically batches of 32) or trying to precisely control the number of elements realized in a lazy sequence. IMHO, if you need fine-grained control such as this, it is better to use an explicit loop than to make assumptions on the timing of realizations in a lazy sequence.

Implementation of lazy filter in clojure

On http://clojure.org/lazy, filter is defined this way:
(defn filter
"Returns a lazy sequence of the items in coll for which
(pred item) returns true. pred must be free of side-effects."
[pred coll]
(let [step (fn [p c]
(when-let [s (seq c)]
(if (p (first s))
(cons (first s) (filter p (rest s)))
(recur p (rest s)))))]
(lazy-seq (step pred coll))))
Is it important that the recursive call is to filter, not to step? If it is, why?
It is with the rest of the code as given here, because it is filter which does the wrapping in lazy-seq. If step called itself, it would do all the filtering at once instead of lazily.
(Updated.) If lazy-seq was added to step's body, step could then call itself and still be lazy. This could be accomplished at least in these two ways:
by wrapping the entire body in lazy-seq and replacing both the recursive call to filter and the recur with calls to step; NB. in this case the step function would need to be named (either by replacing the let with letfn, with the appropriate change in syntax, or by adding a name to the fn form: (fn step ...)); the lazy-seq wrapping the outermost call to step would then be unnecessary; also, at this point you could simply not have an inner helper function (just use this approach in filter directly);
by leaving the lazy-seq in filter in place and wrapping the recursive call in step (which would now be to step itself) in lazy-seq (with the recur form remaining unchanged).
Note that clojure.core/filter has a different implementation, with separate logic handling chunked sequences and no internal helper functions. In the non-chunked case it operates like the version of step described in 1. above.

lazy-seq -- cons outside or in

Should cons be inside (lazy-seq ...)
(def lseq-in (lazy-seq (cons 1 (more-one))))
or out?
(def lseq-out (cons 1 (lazy-seq (more-one))))
I noticed
(realized? lseq-in)
;;; ⇒ false
(realized? lseq-out)
;;; ⇒ <err>
;;; ClassCastException clojure.lang.Cons cannot be cast to clojure.lang.IPending clojure.core/realized? (core.clj:6773)
All the examples on the clojuredocs.org use "out".
What are the tradeoffs involved?
You definitely want (lazy-seq (cons ...)) as your default, deviating only if you have a clear reason for it. clojuredocs.org is fine, but the examples are all community-provided and I would not call them "the docs". Of course, a consequence of how it's built is that the examples tend to get written by people who just learned how to use the construct in question and want to help out, so many of them are poor. I would refer instead to the code in clojure.core, or other known-good code.
Why should this be the default? Consider these two implementations of map:
(defn map1 [f coll]
(when-let [s (seq coll)]
(cons (f (first s))
(lazy-seq (map1 f (rest coll))))))
(defn map2 [f coll]
(lazy-seq
(when-let [s (seq coll)]
(cons (f (first s))
(map2 f (rest coll))))))
If you call (map1 prn xs), then an element of xs will be realized and printed immediately, even if you never intentionally realize an element of the resulting mapped sequence. map2, on the other hand, immediately returns a lazy sequence, delaying all its work until an element is requested.
With cons inside lazy-seq, the evaluation of the expression for the first element of your seq gets deferred; with cons on the outside, it's done right away and only the construction of the "rest" part of the seq is deferred. (So (rest lseq-out) will be a lazy seq.)
Thus, if computing the first element is expensive and it might not be needed at all, putting cons inside lazy-seq makes more sense. If the initial element is supplied to the lazy seq producer as an argument, it may make more sense to use cons on the outside (this is the case with clojure.core/iterate). Otherwise it doesn't make that much of a difference. (The overhead of creating a lazy seq object at the start is negligible.)
Clojure itself uses both approaches (although in the majority of cases lazy-seq wraps the whole seq-producing expression, which may not necessarily start with cons).

In Clojure, are lazy seqs always chunked?

I was under the impression that the lazy seqs were always chunked.
=> (take 1 (map #(do (print \.) %) (range)))
(................................0)
As expected 32 dots are printed because the lazy seq returned by range is chunked into 32 element chunks. However, when instead of range I try this with my own function get-rss-feeds, the lazy seq is no longer chunked:
=> (take 1 (map #(do (print \.) %) (get-rss-feeds r)))
(."http://wholehealthsource.blogspot.com/feeds/posts/default")
Only one dot is printed, so I guess the lazy-seq returned by get-rss-feeds is not chunked. Indeed:
=> (chunked-seq? (seq (range)))
true
=> (chunked-seq? (seq (get-rss-feeds r)))
false
Here is the source for get-rss-feeds:
(defn get-rss-feeds
"returns a lazy seq of urls of all feeds; takes an html-resource from the enlive library"
[hr]
(map #(:href (:attrs %))
(filter #(rss-feed? (:type (:attrs %))) (html/select hr [:link])))
So it appears that chunkiness depends on how the lazy seq is produced. I peeked at the source for the function range and there are hints of it being implemented in a "chunky" manner. So I'm a bit confused as to how this works. Can someone please clarify?
Here's why I need to know.
I have to following code: (get-rss-entry (get-rss-feeds h-res) url)
The call to get-rss-feeds returns a lazy sequence of URLs of feeds that I need to examine.
The call to get-rss-entry looks for a particular entry (whose :link field matches the second argument of get-rss-entry). It examines the lazy sequence returned by get-rss-feeds. Evaluating each item requires an http request across the network to fetch a new rss feed. To minimize the number of http requests it's important to examine the sequence one-by-one and stop as soon as there is a match.
Here is the code:
(defn get-rss-entry
[feeds url]
(ffirst (drop-while empty? (map #(entry-with-url % url) feeds))))
entry-with-url returns a lazy sequence of matches or an empty sequence if there is no match.
I tested this and it seems to work correctly (evaluating one feed url at a time). But I am worried that somewhere, somehow it will start behaving in a "chunky" way and it will start evaluating 32 feeds at a time. I know there is a way to avoid chunky behavior as discussed here, but it doesn't seem to even be required in this case.
Am I using lazy seq non-idiomatically? Would loop/recur be a better option?
You are right to be concerned. Your get-rss-entry will indeed call entry-with-url more than strictly necessary if the feeds parameter is a collection that returns chunked seqs. For example if feeds is a vector, map will operate on whole chunks at a time.
This problem is addressed directly in Fogus' Joy of Clojure, with the function seq1 defined in chapter 12:
(defn seq1 [s]
(lazy-seq
(when-let [[x] (seq s)]
(cons x (seq1 (rest s))))))
You could use this right where you know you want the most laziness possible, right before you call entry-with-url:
(defn get-rss-entry
[feeds url]
(ffirst (drop-while empty? (map #(entry-with-url % url) (seq1 feeds)))))
Lazy seqs are not always chunked - it depends on how they are produced.
For example, the lazy seq produced by this function is not chunked:
(defn integers-from [n]
(lazy-seq (cons n (do (print \.) (integers-from (inc n))))))
(take 3 (integers-from 3))
=> (..3 .4 5)
But many other clojure built-in functions do produce chunked seqs for performance reasons (e.g. range)
Depending on the vagueness of Chunking seems unwise as you mention above. Explicitly "un chunking" in cases where you really need it not to be chunked is also wise because then if at some other point your code changes in a way that chunkifies it things wont break. On another note, if you need actions to be sequential, agents are a great tool you could send the download functions to an agent then they will be run one at a time and only once regardless of how you evaluate the function. At some point you may want to pmap your sequence and then even un-chunking will not work though using an atom will continue to work correctly.
I have discussed this recently in Can I un-chunk lazy sequences to realize one element at a time? and the conclusion is that if you need to control when items are produced/consumed, you should not use lazy sequences.
For processing you can use transducers, where you control when the next item is processed.
For producing the elements, the ideal approach is to reify ISeq. A practical approach is to use lazy-seq with a single cons call in it whose rest is a recursive call. But notice that this relies on an implementation detail of lazy-seq.

More efficient split-with in clojure

Clojure's split-with function is quite handy, but has to traverse the leading part of the seq twice, as it is literally implemented as [(take-while pred coll) (drop-while pred coll)]. Still, it is fairly easy to write a (tail-recursive) version that traverses the leading part only once (put the leading part in an accumulating vector, etc.).
However, I would like to extract the first element of a list that satisfies a predicate and return the both the element, and the remaining list (i.e. (concat (take-while pred coll) (next (drop-while pred coll)))) -- hopefully in a single pass. If I were using some imperative language, I would just traverse the list, holding onto the last cell, and, once I get the element to pop out, fiddle with the "next pointer" of the previous cell to reconstruct the modified list, but this seems out of question in a functional language.
So is there a way to do that efficiently in Clojure?
For split-with (and similar tasks where you want to produce two outputs from one input), you can have any two of
Laziness
Immutability
Perfect efficiency.
For example, if you don't want laziness (of the first "dropped" portion), you can get the other two by implementing a tail-recursive version as you suggest.
All this is not really applicable to your current question, since you only want one output sequence, and I recommend kotarak's solution (or something else like it). However, I thought you might like an explanation for why Clojure's built-in split-with traverses the input sequence twice.
You can always drop down to lazy-seq for special requirements.
(defn splice-tail
([pred coll] (splice-tail pred 1 coll))
([pred n coll]
(lazy-seq
(when-let [s (seq coll)]
(let [fst (first s)]
(if (pred fst)
(cons fst (splice-tail pred n (rest s)))
(nthnext s n)))))))