I have noticed that lazy sequences in Clojure seem to be represented internally as linked lists (Or at least they are being treated as a sequence with only sequential access to elements). Even after being cached into memory, access time over the lazy-seq with nth is O(n), not constant time as with vectors.
;; ...created my-lazy-seq here and used the first 50,000 items
(time (nth my-lazy-seq 10000))
"Elapsed time: 1.081325 msecs"
(time (nth my-lazy-seq 20000))
"Elapsed time: 2.554563 msecs"
How do I get a constant-time lookups or create a lazy vector incrementally in Clojure?
Imagine that during generation of the lazy vector, each element is a function of all elements previous to it, so the time spent traversing the list becomes a significant factor.
Related questions only turned up this incomplete Java snippet:
Designing a lazy vector: problem with const
Yes, sequences in Clojure are described as "logical lists" with three operations (first, next and cons).
A sequence is essentially the Clojure version of an iterator (although clojure.org insists that sequences are not iterators, since they don't hold iternal state), and can only move through the backing collection in a linear front-to-end fashion.
Lazy vectors do not exist, at least not in Clojure.
If you want constant time lookups over a range of indexes, without calculating intermediate elements you don't need, you can use a function that calculates the result on the fly. Combined with memoization (or caching the results in an arg-to-result hash on your own) you get pretty much the same effect as I assume you want from the lazy vector.
This obviously only works when there are algorithms that can compute f(n) more directly than going through all preceding f(0)...f(n-1). If there is no such algorithm, when the result for every element depends on the result for every previous element, you can't do better than the sequence iterator in any case.
Edit
BTW, if all you want is for the result to be a vector so you get quick lookups afterwards, and you don't mind that elements are created sequentially the first time, that's simple enough.
Here is a Fibonacci implementation using a vector:
(defn vector-fib [v]
(let [a (v (- (count v) 2)) ; next-to-last element
b (peek v)] ; last element
(conj v (+ a b))))
(def fib (iterate vector-fib [1 1]))
(first (drop 10 fib))
=> [1 1 2 3 5 8 13 21 34 55 89 144]
Here we are using a lazy sequence to postpone the function calls until asked for (iterate returns a lazy sequence), but the results are collected and returned in a vector.
The vector grows as needed, we add only the elements up to the last one asked for, and once computed it's a constant time lookup.
Was it something like this you had in mind?
Related
I am trying to implement the behavior of a stack in Clojure. Taking a cue from the implementation of frequencies I created a transient vector which I am conj!ing elements to (a la "push"). My issue is pop! removes elements from the end and a few other fns (rest,drop) only work on lazy sequences.
I know I could accomplish this using loop/recur (or reverseing and pop!ing) but I want to better understand why removing from the beginning of a transient vector isn't allowed. I read this, is it because the implementation that allows them to be mutated is only O(1) because yr only editing nodes on the end and if you changed the first node that requires copying the entire vector?
Your difficulty is not with transients: pop and peek always work at the same end of a collection as conj does:
the end of a vector or
the head of a list.
So ...
(= (pop (conj coll x)) coll)
and
(= (peek (conj coll x)) x)
... are true for any x for any coll that implements IPersistentStack:
Vectors and lists do.
Conses and ranges don't.
If you want to look at your stack both ways, use a vector, and use the (constant time) rseq to reverse it. You'd have to leave transients, though, as there is no rseq!. Mind you (comp rseqpersistent!) is still constant time.
By the way, rest and drop work on any sequence, lazy or not:
=> (rest [1 2 3])
(2 3)
With the introduction of Spec, I try to write test.check generators for all of my functions. This is fine for simple data structures, but tends to become difficult with data structures that have parts that depend on each other. In other words, some state management within the generators is then required.
It would already help enormously to have generator-equivalents of Clojure's loop/recur or reduce, so that a value produced in one iteration can be stored in some aggregated value that is then accessible in a subsequent iteration.
One simple example where this would be required, is to write a generator for splitting up a collection into exactly X partitions, with each partition having between zero and Y elements, and where the elements are then randomly assigned to any of the partitions. (Note that test.chuck's partition function does not allow to specify X or Y).
If you write this generator by looping through the collection, then this would require access to the partitions filled up during previous iterations, to avoid exceeding Y.
Does anybody have any ideas? Partial solutions I have found:
test.check's let and bind allow you to generate a value and then reuse that value later on, but they do not allow iterations.
You can iterate through a collection of previously generated values with a combination of the tuple and bindfunctions, but these iterations do not have access to the values generated during previous iterations.
(defn bind-each [k coll] (apply tcg/tuple (map (fn [x] (tcg/bind (tcg/return x) k)) coll))
You can use atoms (or volatiles) to store & access values generated during previous iterations. This works, but is very un-Clojure, in particular because you need to reset! the atom/volatile before the generator is returned, to avoid that their contents would get reused in the next call of the generator.
Generators are monad-like due to their bind and return functions, which hints at the use of a monad library such as Cats in combination with a State monad. However, the State monad was removed in Cats 2.0 (because it was allegedly not a good fit for Clojure), while other support libraries I am aware of do not have formal Clojurescript support. Furthermore, when implementing a State monad in his own library, Jim Duey — one of Clojure's monad experts — seems to warn that the use of the State monad is not compatible with test.check's shrinking (see the bottom of http://www.clojure.net/2015/09/11/Extending-Generative-Testing/), which significantly reduces the merits of using test.check.
You can accomplish the iteration you're describing by combining gen/let (or equivalently gen/bind) with explicit recursion:
(defn make-foo-generator
[state]
(if (good-enough? state)
(gen/return state)
(gen/let [state' (gen-next-step state)]
(make-foo-generator state'))))
However, it's worth trying to avoid this pattern if possible, because each use of let/bind undermines the shrinking process. Sometimes it's possible to reorganize the generator using gen/fmap. For example, to partition a collection into a sequence of X subsets (which I realize is not exactly what your example was, but I think it could be tweaked to fit), you could do something like this:
(defn partition
[coll subset-count]
(gen/let [idxs (gen/vector (gen/choose 0 (dec subset-count))
(count coll))]
(->> (map vector coll idxs)
(group-by second)
(sort-by key)
(map (fn [[_ pairs]] (map first pairs))))))
When processing each element in a seq I normally use first and rest.
However these will cause a lazy-seq to lose its "laziness" by calling seq on the argument. My solution has been to use (first (take 1 coll)) and (drop 1 coll) in their place when working with lazy-seqs, and while I think drop 1 is just fine, I don't particularly like having to call first and take to get the first element.
Is there a more idiomatic way to do this?
The docstrings for first and rest say that these functions call seq on their arguments to convey the idea that you don't have to call seq yourself when passing in a seqable collection which is not in itself a seq, like, say, a vector or set. For example,
(first [1 2 3])
;= 1
would not work if first didn't call seq on its argument; you'd have to say
(first (seq [1 2 3]))
instead, which would be inconvenient.
Both take and drop also call seq on their arguments, otherwise you couldn't call them on vectors and the like as explained above. In fact this is true of all standard seq collections -- those which do not call seq directly are built upon lower-level components which do.
In no way does this impair the laziness of lazy seqs. The forcing / realization which happens as a result of a first / rest call is the smallest amount possible to obtain the requested result. (How much that is depends on the type of the argument; if it is not in fact lazy, there is no extra realization involved in the first call; if it is partly lazy -- that is, chunked -- there will be some extra realization (up to 32 initial elements will be computed at once); if it's fully lazy, only the first element will be computed.)
Clearly first, when passed a lazy seq, must force the realization of its first element -- that's the whole point. rest is actually somewhat lazy in that it actually doesn't force the realization of the "rest" part of the seq (that's in contrast to next, which is basically equivalent to (seq (rest ...))). The fact that it does force the first element to be realized so that it can skip over it immediately is a conscious design choice which avoids unnecessary layering of lazy seq objects and holding the head of the original seq; you could say something like (lazy-seq (rest xs)) to defer even this initial realization, at the cost of holding on to xs until realized the lazy seq wrapper is realized.
Here is the implementation of frequencies in clojure:
(defn frequencies
"Returns a map from distinct items in coll to the number of times
they appear."
[coll]
(persistent!
(reduce (fn [counts x]
(assoc! counts x (inc (get counts x 0))))
(transient {}) coll)))
Is assoc! considered a mutation or not?
What is the complexity of assoc! inside frequencies?
Also it seems that counts is accessed twice in each iteration: does it cause a performance penalty?
assoc! is mutation of a transient, it is O(log n) amortised I believe. Hence the whole executions of frequencies is O(n log n).
counts is a locally bound variable, so accessing it twice is no problem.
Here is a functional version of freqencies that doesn't use any multiple state:
(defn frequencies-2 [coll]
(reduce (fn [m v] (assoc m v (inc (get m v 0)))) {} coll))
This functional version is also O(n log n), though it will have somewhat more overhead (a higher constant factor) due to creating and discarding more temporary objects.
You could use a tree to store the map from elements to frequencies with log(n) complexity (it can be a binary search tree, an AVL, a red-black tree, etc.).
Choose a functional implementation of this tree, i.e. you can't mutate it, but instead assoc counts x freq returns a new data structure, sharing in memomry the common parts with counts. It's a kind of "copy on write".
Then the performance of computing all frequencies would be O(n log(n)).
assoc! mutates a transient data and it has much better performance than assoc. It is not really a violation of the immutable Clojure's model (see http://clojure.org/transients).
persistent! and transient are O(1)
assoc! is O(log32 n) which is effectively O(1) as hash-map has an upper bound on the size of ~2^32 items this leaves a maximum tree depth of 6
Therefore, the complexity of frequencies is linear on the size of coll.
Remark: As noticed by #mikera, the complexity of frequencies would be linear also with assoc but with a higher constant factor.
In clojure lists grow from the left and vectors grow from the right, so:
user> (conj '(1 2 3) 4)
(4 1 2 3)
user> (conj [1 2 3] 4)
[1 2 3 4]
What's the most efficient method of inserting values both into the front and the back of a sequence?
You need a different data structure to support fast inserting at both start and end. See https://github.com/clojure/data.finger-tree
As I understand it, a sequence is just a generic data structure so it depends on the specific implementation you are working with.
For a data structure that supports random access (e.g. a vector), it should take constant time, O(1).
For a list, I would expect inserting at the front of the list with a cons operation to take constant time, but inserting to the back of the list will take O(n) since you have to traverse the entire structure to get to the end.
There is, of course, a lot of other data structures that can theoretically be a sequence (e.g. trees) that will have their own O(n) characteristics.