Clojure where is left fold - clojure

Does clojure implement left fold or right fold?
I understand there is a new library reducers which has this but shouldn't it exists in clojure.core?

Clojure implements a left fold called reduce.
Why no right fold?
reduce and many other functions work on sequences, which are
accessible from the left but not the right.
The new reducers and transducers are designed to work with associative functions on data structures of varying accessibility.

As Thumbnail points out, reduce-right cannot be efficiently implemented on the jvm for sequences. But as it turns out, we do have a family of data types that can do efficient lookup and truncation from the right side. reduce-right can be implemented for vectors.
user>
(defn reduce-right
[f init vec]
(loop [acc init v vec]
(if (empty? v)
acc
(recur (f acc (peek v)) (pop v)))))
#'user/reduce-right
user> (count (str (reduce-right *' 1 (into [] (range 1 100000))))) ; digit count
456569

Related

How do Clojure transducers work under the hood?

Reading all over (except the clojure source code) it is somewhat hard to fathom how do transducers avoid the usage of intermediate collections, which is supposed to make them more lean and performant.
A related question arises as to whether or not, they assume, that each input transformation is applied to each element of its input independently from the other elements of it, a limitation that may exist if transducers were to work by squashing the input transformations on the input collection ― element-by-element.
Do they inspect the code of their input functions to determine how to interweave them such that they yield the correct result of their composition?
Can you please detail how do transducers in clojure work under the hood, in those regards?
A related question arises as to whether or not, they assume, that each
input transformation is applied to each element of its input
independently from the other elements of it
They are named transducers because they may have (implicit) state.
A transducer is a function that takes a reducing function and returns a reducing function.
A reducing function is a function that expects two parameters: an accumulator and an item and returns an updated accumulator.
This is the reducing function which holds the mutable state (if any).
To get transducers you have to understand they work in two times: composition-time then computation-time. That's why they are functions returning functions.
Let's start with an easy reducing function: conj.
The transducer returned by (map inc) is (fn [rf] (fn [acc x] (rf acc (inc x)))). When called with conj it returns a function equivalent to (fn [acc x] (conj acc (inc x))).
The transducer returned by (filter odd?) is (fn [rf] (fn [acc x] (if (odd? x) (rf acc x) acc))). When called with conj it returns a function equivalent to (fn [acc x] (if (odd? x) (conj acc x) acc))). This one is interesting because rf (the downstream reducing function is sometimes short-circuited).
If you want to chain these two transducers you just do (comp (map inc) (filter odd?)) if you pass conj to this composite transducer, (filter odd?) is going to be the first to wrap conj (because comp applies functions from right to left). Then the resulting filtered-rffunction is passed to (map inc) which yields a function equiavlent to:
(fn [acc x] (filtered-rf acc (inc x))) where filtered-rf is (fn [acc x] (if (odd? x) (conj acc x) acc))). If you inline filtered-rf you get: (fn [acc x] (let [x+1 (inc x)] (if (odd? x+1) (conj acc x+1) acc))).
As you may see no intermediate collection or sequence is allocated.
For stateful transducers that's the same story except that reducing functions have mutable state (as little as possible and avoid keeping all previous items in it): usually a volatile box (see volatile!) or a mutable Java object.
You may also have remarked that in the example items are first mapped then filtered: computations are applied from left to right which seems in contradiction to comp. This is not: remember comp here composes transducers, fns that wraps reducing fns. So at compoition time the wrapping occurs right to left (conj wrapped by the "filtering rf" then by the "mapping rf") but at computation time the wrapping layers are traversed inwards: map, filter and then conj.
There are finicky implementation details to know to implement your own transducers (reduced, init and completion arities) but the general idea is the one exposed above.

Why is fold and reduce considered fundamental - surely everything is defined in terms of cons and car?

We can see that we can use reduce/foldl1 as the function by which we can define other higher order functions such as map, filter and reverse.
(defn mapl [f coll]
(reduce (fn [r x] (conj r (f x)))
[] coll))
(defn filterl [pred coll]
(reduce (fn [r x] (if (pred x) (conj r x) r))
[] coll))
(defn mapcatl [f coll]
(reduce (fn [r x] (reduce conj r (f x)))
[] coll))
We also appear to be able to do this in terms of foldr. Here is map and filter in terms of foldr from Rich Hickey's Transducers talk at 17:25.
(defn mapr [f coll]
(foldr (fn [x r] (cons (f x) r))
() coll))
(defn filterr [pred coll]
(foldr (fn [x r] (if (pred x) (cons x r) r))
() coll))
Now we can define map, foldl (reduce) and foldr in terms of first, rest and cons (car, cdr and cons):
(defn foldr [f z xs]
(if (null? xs)
z
(f (first xs) (foldr f z (rest xs)))))
(defn foldl [f z xs]
(if (null? xs)
z
(foldl f (f z (first xs)) (rest xs))))
(defn map [f lst]
(if (null? lst)
'()
(cons (f (first lst)) (map f (rest lst)))))
My question is Why is fold and reduce considered fundamental - surely everything is defined in terms of cons, cdr and car? Isn't this looking at things at the wrong level?
I thought Rich Hickey explained exactly that in his talk about transducers.
He sees folds as a fundamental concept of transformation. It doesn't need to know what structure it is working on and how to operate on that structure.
You just defined fold in terms of itself, cdr, car and rest. What Rich is arguing for is that fold in itself is an abstract concept separate from the data structure that it operates on and, as long as we provide it certain functions that actually operate on the data structure, it will work as expected.
So in the end it's all about separation of concerns and reusability.
Folds are a unifying framework for data processing in principled way, because they correspond to inductive data definitions. They are not ad-hoc. cons/car/cdr are building blocks for creating data (and code) but in unprincipled way (we can do anything, and process it ad-hoc). Folds are higher level, more disciplined, more predictable, easier to reason about.
Given the ad-hoc implementations for map, filter, mapcat for lists, we can see there's something similar in them - the structure of code that follows the structure of data (list, built of cons nodes, to which correspond the two arguments of the combining function). And that's fold.
In that talk, Rich Hickey abstracts away the step function. But he doesn't abstract over the data. Any combining function used to fold over labeled trees has to take three parameters, not two. So what his -ing functions do is, still, folding on lists, conceptually; it's just that they abstract over the concrete implementations of lists - either as cons cells, hashed array trees, whatever.
Most operators in a lispy language will have a partner that you can use to define it or vice versa. This attribute is called "metacircularity". There is no one specific set of basic operators that is the fundamental minimum to allow the full set of programmability given by the language.
You can read more at this paper:
http://home.pipeline.com/~hbaker1/MetaCircular.html

Lazy partition-by

I have a source of items and want to separately process runs of them having the same value of a key function. In Python this would look like
for key_val, part in itertools.groupby(src, key_fn):
process(key_val, part)
This solution is completely lazy, i.e. if process doesn't try to store contents of entire part, the code would run in O(1) memory.
Clojure solution
(doseq [part (partition-by key-fn src)]
(process part))
is less lazy: it realizes each part completely. The problem is, src might have very long runs of items with the same key-fn value and realizing them might lead to OOM.
I've found this discussion where it's claimed that the following function (slightly modified for naming consistency inside post) is lazy enough
(defn lazy-partition-by [key-fn coll]
(lazy-seq
(when-let [s (seq coll)]
(let [fst (first s)
fv (key-fn fst)
part (lazy-seq (cons fst (take-while #(= fv (key-fn %)) (next s))))]
(cons part (lazy-partition-by key-fn (drop-while #(= fv (key-fn %)) s)))))))
However, I don't understand why it doesn't suffer from OOM: both parts of the cons cell hold a reference to s, so while process consumes part, s is being realized but not garbage collected. It would become eligible for GC only when drop-while traverses part.
So, my questions are:
Am I correct about lazy-partition-by not being lazy enough?
Is there an implementation of partition-by with guaranteed memory requirements, provided I don't hold any references to the previous part by the time I start realizing the next one?
EDIT:
Here's a lazy enough implementation in Haskell:
lazyPartitionBy :: Eq b => (a -> b) -> [a] -> [[a]]
lazyPartitionBy _ [] = []
lazyPartitionBy keyFn xl#(x:_) = let
fv = keyFn x
(part, rest) = span ((== fv) . keyFn) xl
in part : lazyPartitionBy keyFn rest
As can be seen from span implementation, part and rest implicitly share state. I wonder if this method could be translated into Clojure.
The rule of thumb that I use in these scenarios (ie, those in which you want a single input sequence to produce multiple output sequences) is that, of the following three desirable properties, you can generally have only two:
Efficiency (traverse the input sequence only once, thus not hold its head)
Laziness (produce elements only on demand)
No shared mutable state
The version in clojure.core chooses (1,3), but gives up on (2) by producing an entire partition all at once. Python and Haskell both choose (1,2), although it's not immediately obvious: doesn't Haskell have no mutable state at all? Well, its lazy evaluation of everything (not just sequences) means that all expressions are thunks, which start out as blank slates and only get written to when their value is needed; the implementation of span, as you say, shares the same thunk of span p xs' in both of its output sequences, so that whichever one needs it first "sends" it to the result of the other sequence, effecting the action at a distance that's necessary to preserve the other nice properties.
The alternative Clojure implementation you linked to chooses (2,3), as you noted.
The problem is that for partition-by, declining either (1) or (2) means that you're holding the head of some sequence: either the input or one of the outputs. So if you want a solution where it's possible to handle arbitrarily large partitions of an arbitrarily large input, you need to choose (1,2). There are a few ways you could do this in Clojure:
Take the Python approach: return something more like an iterator than a seq - seqs make stronger guarantees about non-mutation, and promise that you can safely traverse them multiple times, etc etc. If instead of a seq of seqs you return an iterator of iterators, then consuming items from any one iterator can freely mutate or invalidate the other(s). This guarantees consumption happens in order and that memory can be freed up.
Take the Haskell approach: manually thunk everything, with lots of calls to delay, and require the client to call force as often as necessary to get data out. This will be a lot uglier in Clojure, and will greatly increase your stack depth (using this on a non-trivial input will probably blow the stack), but it is theoretically possible.
Write something more Clojure-flavored (but still quite unusual) by having a few mutable data objects that are coordinated between the output seqs, each updated as needed when something is requested from any of them.
I'm pretty sure any of these three approaches are possible, but to be honest they're all pretty hard and not at all natural. Clojure's sequence abstraction just doesn't make it easy to produce a data structure that's what you'd like. My advice is that if you need something like this and the partitions may be too large to fit comfortably, you just accept a slightly different format and do a little more bookkeeping yourself: dodge the (1,2,3) dilemma by not producing multiple output sequences at all!
So instead of ((2 4 6 8) (1 3 5) (10 12) (7)) being your output format for something like (partition-by even? [2 4 6 8 1 3 5 10 12 7]), you could accept a slightly uglier format: ([::key true] 2 4 6 8 [::key false] 1 3 5 [::key true] 10 12 [::key false] 7). This is neither hard to produce nor hard to consume, although it is a bit lengthy and tedious to write out.
Here is one reasonable implementation of the producing function:
(defn lazy-partition-by [f coll]
(lazy-seq
(when (seq coll)
(let [x (first coll)
k (f x)]
(list* [::key k] x
((fn part [k xs]
(lazy-seq
(when (seq xs)
(let [x (first xs)
k' (f x)]
(if (= k k')
(cons x (part k (rest xs)))
(list* [::key k'] x (part k' (rest xs))))))))
k (rest coll)))))))
And here's how to consume it, first defining a generic reduce-grouped that hides the details of the grouping format, and then an example function count-partition-sizes to output the key and size of each partition without keeping any sequences in memory:
(defn reduce-grouped [f init groups]
(loop [k nil, acc init, coll groups]
(if (empty? coll)
acc
(if (and (coll? (first coll)) (= ::key (ffirst coll)))
(recur (second (first coll)) acc (rest coll))
(recur k (f k acc (first coll)) (rest coll))))))
(defn count-partition-sizes [f coll]
(reduce-grouped (fn [k acc _]
(if (and (seq acc) (= k (first (peek acc))))
(conj (pop acc) (update-in (peek acc) [1] inc))
(conj acc [k 1])))
[] (lazy-partition-by f coll)))
user> (lazy-partition-by even? [2 4 6 8 1 3 5 10 12 7])
([:user/key true] 2 4 6 8 [:user/key false] 1 3 5 [:user/key true] 10 12 [:user/key false] 7)
user> (count-partition-sizes even? [2 4 6 8 1 3 5 10 12 7])
[[true 4] [false 3] [true 2] [false 1]]
Edit: Looking at it again, I'm not really convinced my reduce-grouped is much more useful than (reduce f init (map g xs)), since it doesn't really give you any clear indication of when the key changes. So if you do need to know when a group changes, you'll want a smarter abstraction, or to use my original lazy-partition-by with nothing "clever" wrapping it.
Although this question evokes very interesting contemplation about language design, the practical problem is you want to process on partitions in constant memory. And the practical problem is resolvable with a little inversion.
Rather than processing over the result of a function that returns a sequence of partitions, pass the processing function into the function that produces the partitions. Then, you can control state in a contained manner.
First we'll provide a way to fuse together the consumption of the sequence with the state of the tail.
(defn fuse [coll wick]
(lazy-seq
(when-let [s (seq coll)]
(swap! wick rest)
(cons (first s) (fuse (rest s) wick)))))
Then a modified version of partition-by
(defn process-partition-by [processfn keyfn coll]
(lazy-seq
(when (seq coll)
(let [tail (atom (cons nil coll))
s (fuse coll tail)
fst (first s)
fv (keyfn fst)
pred #(= fv (keyfn %))
part (take-while pred s)
more (lazy-seq (drop-while pred #tail))]
(cons (processfn part)
(process-partition-by processfn keyfn more))))))
Note: For O(1) memory consumption processfn must be an eager consumer! So while (process-partition-by identity key-fn coll) is the same as (partition-by key-fn coll), because identity does not consume the partition, the memory consumption is not constant.
Tests
(defn heavy-seq []
;adjust payload for your JVM so only a few fit in memory
(let [payload (fn [] (long-array 20000000))]
(map #(vector % (payload)) (iterate inc 0))))
(defn my-process [s] (reduce + (map first s)))
(defn test1 []
(doseq [part (partition-by #(quot (first %) 10) (take 50 (heavy-seq)))]
(my-process part)))
(defn test2 []
(process-partition-by
my-process #(quot (first %) 20) (take 200 (heavy-seq))))
so.core=> (test1)
OutOfMemoryError Java heap space [trace missing]
so.core=> (test2)
(190 590 990 1390 1790 2190 2590 2990 3390 3790)
Am I correct about lazy-partition-by not being lazy enough?
Well, there's a difference between laziness and memory usage. A sequence can be lazy and still require lots of memory - see for instance the implementation of clojure.core/distinct, which uses a set to remember all the previously observed values in the sequence. But yes, your analysis of the memory requirements of lazy-partition-by is correct - the function call to compute the head of the second partition will retain the head of the first partition, which means that realizing the first partition causes it to be retained in-memory. This can be verified with the following code:
user> (doseq [part (lazy-partition-by :a
(repeatedly
(fn [] {:a 1 :b (long-array 10000000)})))]
(dorun part))
; => OutOfMemoryError Java heap space
Since neither doseq nor dorun retains the head, this would simply run forever if lazy-partition-by were O(1) in memory.
Is there an implementation of partition-by with guaranteed memory requirements, provided I don't hold any references to the previous part by the time I start realizing the next one?
It would be very difficult, if not impossible, to write such an implementation in a purely functional manner that would work for the general case. Consider that a general lazy-partition-by implementation cannot make any assumptions about when (or if) a partition is realized. The only guaranteed correct way of finding the start of the second partition, short of introducing some nasty bit of statefulness to keep track of how much of the first partition has been realized, is to remember where the first partition began and scan forward when requested.
For the special case where you're processing records one at a time for side effects and want them grouped by key (as is implied by your use of doseq above), you might consider something along the lines of a loop/recur which maintains a state and re-sets it when the key changes.

lazy-seq -- cons outside or in

Should cons be inside (lazy-seq ...)
(def lseq-in (lazy-seq (cons 1 (more-one))))
or out?
(def lseq-out (cons 1 (lazy-seq (more-one))))
I noticed
(realized? lseq-in)
;;; ⇒ false
(realized? lseq-out)
;;; ⇒ <err>
;;; ClassCastException clojure.lang.Cons cannot be cast to clojure.lang.IPending clojure.core/realized? (core.clj:6773)
All the examples on the clojuredocs.org use "out".
What are the tradeoffs involved?
You definitely want (lazy-seq (cons ...)) as your default, deviating only if you have a clear reason for it. clojuredocs.org is fine, but the examples are all community-provided and I would not call them "the docs". Of course, a consequence of how it's built is that the examples tend to get written by people who just learned how to use the construct in question and want to help out, so many of them are poor. I would refer instead to the code in clojure.core, or other known-good code.
Why should this be the default? Consider these two implementations of map:
(defn map1 [f coll]
(when-let [s (seq coll)]
(cons (f (first s))
(lazy-seq (map1 f (rest coll))))))
(defn map2 [f coll]
(lazy-seq
(when-let [s (seq coll)]
(cons (f (first s))
(map2 f (rest coll))))))
If you call (map1 prn xs), then an element of xs will be realized and printed immediately, even if you never intentionally realize an element of the resulting mapped sequence. map2, on the other hand, immediately returns a lazy sequence, delaying all its work until an element is requested.
With cons inside lazy-seq, the evaluation of the expression for the first element of your seq gets deferred; with cons on the outside, it's done right away and only the construction of the "rest" part of the seq is deferred. (So (rest lseq-out) will be a lazy seq.)
Thus, if computing the first element is expensive and it might not be needed at all, putting cons inside lazy-seq makes more sense. If the initial element is supplied to the lazy seq producer as an argument, it may make more sense to use cons on the outside (this is the case with clojure.core/iterate). Otherwise it doesn't make that much of a difference. (The overhead of creating a lazy seq object at the start is negligible.)
Clojure itself uses both approaches (although in the majority of cases lazy-seq wraps the whole seq-producing expression, which may not necessarily start with cons).

More efficient split-with in clojure

Clojure's split-with function is quite handy, but has to traverse the leading part of the seq twice, as it is literally implemented as [(take-while pred coll) (drop-while pred coll)]. Still, it is fairly easy to write a (tail-recursive) version that traverses the leading part only once (put the leading part in an accumulating vector, etc.).
However, I would like to extract the first element of a list that satisfies a predicate and return the both the element, and the remaining list (i.e. (concat (take-while pred coll) (next (drop-while pred coll)))) -- hopefully in a single pass. If I were using some imperative language, I would just traverse the list, holding onto the last cell, and, once I get the element to pop out, fiddle with the "next pointer" of the previous cell to reconstruct the modified list, but this seems out of question in a functional language.
So is there a way to do that efficiently in Clojure?
For split-with (and similar tasks where you want to produce two outputs from one input), you can have any two of
Laziness
Immutability
Perfect efficiency.
For example, if you don't want laziness (of the first "dropped" portion), you can get the other two by implementing a tail-recursive version as you suggest.
All this is not really applicable to your current question, since you only want one output sequence, and I recommend kotarak's solution (or something else like it). However, I thought you might like an explanation for why Clojure's built-in split-with traverses the input sequence twice.
You can always drop down to lazy-seq for special requirements.
(defn splice-tail
([pred coll] (splice-tail pred 1 coll))
([pred n coll]
(lazy-seq
(when-let [s (seq coll)]
(let [fst (first s)]
(if (pred fst)
(cons fst (splice-tail pred n (rest s)))
(nthnext s n)))))))