IndexedSeq VS. PersistentVector - clojure

Can somebody explain me, the difference between 'IndexedSeq' and 'PersistentVector'?
I bumped into this, when updating a vector in my data structure via 'rest'. Here's a REPL excerpt that shows the transformation.
=> (def xs [1 2 3])
...
(type xs)
cljs.core/PersistentVector
=> (def xs2 (rest xs))
...
(type xs2)
cljs.core/IndexedSeq
I'm holding a list in an app-state atom, which needs to be shifted once in a while, so the first item must disappear. Would be really cool, if anybody could give me a hint about which data structure might be preferable here in terms of performance.
Sometimes elements get pushed to the end of the list as well, so I guess it's a LIFO mechanism that I'm creating here.

From your last paragraph, it sounds like you're using this as a stack. Taken together, pop, peek, and conj form a stack interface that can be used with either lists or vectors (working on the front of a list or the end of a vector). I would use those.
If you're just using those functions, I don't think there should be any significant performance differences (all three functions should be constant time).

looking at the superinterfaces here: http://static.javadoc.io/org.clojure/clojure/1.7.0/clojure/lang/IndexedSeq.html
I can guess, it is not the most efficient thing here, since it is just a seq, with no guaranteed constant-time access to the nth member. To ensure the vector semantics you should probably use subvec to remove the first element.
In general, if you don't do random access to elements, in terms of performance it should be enough to use concat to add element to the end (as it produces a lazy sequence, won't consume the whole collection, and should be done in a constant time) and rest to remove the first element (as it is also done in a constant time), to make FIFO stack (which is what you do). (it's not the best variant still, since it may lead to stack owerflow, if you do alot of push without realizing the sequence.
But sure it's better to use vectors. So the combination of conj , first, and subvec should be your choice.

Related

Non-map collection predicate?

Is there a Clojure predicate that means "collection, but not a map"?
Such a predicate is/would be valuable because there are many operations that can be performed on all collections except maps. For example (apply + ...) or (reduce + ...) can be used with vectors, lists, lazy sequences, and sets, but not maps, since the elements of a map in such a context end up as clojure.lang.MapEntrys. It's sets and maps that cause the problem with those predicates that I know of:
sequential? is true for vectors, lists, and lazy sequences, but it's false for both maps and sets. (seq? is similar but it's false for vectors.)
coll? and seqable? are true for both sets and maps, as well as for every other kind of collection I can think of.
Of course I can define such a predicate, e.g. like this:
(defn coll-but-not-map?
[xs]
(and (coll? xs)
(not (map? xs))))
or like this:
(defn sequential-or-set?
[xs]
(or (sequential? xs)
(set? xs)))
I'm wondering whether there's a built-in clojure.core (or contributed library) predicate that does the same thing.
This question is related to this one and this one but isn't answered by their answers. (If my question is a duplicate of one I haven't found, I'm happy to have it marked as such.)
For example (apply + ...) or (reduce + ...) can be used with vectors, lists, lazy sequences, and sets, but not maps
This is nothing about collections, I think. In your case, you have a problem not with general apply or reduce application, but with particular + function. (apply + [:a :b :c]) won't work either even though we are using a vector here.
My point is that you are trying to solve very domain specific problem, that's why there is no generic solution in Clojure itself. So use any proper predicate you can think of.
There's nothing that I've found or used that fits this description. I think your own predicate function is clear, simple, and easy to include in your code if you find it useful.
Maybe you are writing code that has to be very generic, but it's usually the case that a function both accepts and returns a consistent type of data. There are cases where this is not true, but it's usually the case that if a function can be the jack of all trades, it's doing too much.
Using your example -- it makes sense to add a vector of numbers, a list of numbers, or a set of numbers. But a map of numbers? It doesn't make sense, unless maybe it's the values contained in the map, and in this case, it's not reasonable for a single piece of code to be expected to handle adding both sequential data and associative data. The function should be handed something it expects, and it should return something consistent. This kind of reminds me of Stuart Sierra's blog post discussing consistency in this regard. Without more information I'm only guessing as to your use case, but it's something to consider.

Why is cons necessary to prevent infinite recursion

When defining an infinite sequence, I noticed that cons is necessary to avoid infinite recursion. However, what I don't understand is why. Here is the code in question:
(defn even-numbers
([] (even-numbers 0))
([n] (cons n (lazy-seq (even-numbers (+ 2 n))))))
(take 10 (even-numbers))
;; (0 2 4 6 8 10 12 14 16 18)
This works great; but since I love to question things, I began to wonder why the cons was needed (other than to include 0). After all, the lazy-seq function creates a lazy-seq. Which means, the rest of the values should not be calculated until called (or chunked). So, I tried it.
(defn even-numbers-v2
([] (even-numbers-v2 0))
([n] (lazy-seq (even-numbers-v2 (+ 2 n)))))
(take 10 (even-numbers-v2))
;; Infinite loooooooooop
So, now I know that cons is necessary, but I'd like to know why cons is necessary to cause lazy evaluation of a supposedly lazy sequence
Lazy seqs are a way to defer computation of actual seq elements, but those elements do need to be computed eventually. That doesn't actually have to involve cons – for example clojure.core/concat uses "chunked conses" when processing chunked operands, and it's ok to wrap any concrete seq type whatsoever in lazy-seq – but some kind of non-lazy return after however many layers of lazy-seq is necessary if any seq processing is to take place. Otherwise there won't even be a first element to get to.
Put yourself in the position of a function that's been handed a lazy seq. The caller has told it, in effect, "here's this thing that's for all intents and purposes a seq, but I don't feel like computing the actual elements until later". Now our function needs some actual elements to operate, so it pokes and prods the seq to try and get it to produce some elements… and then what?
If peeling off some lazy-seq layers eventually produces a Cons cell, a list, a seq over a vector or any other concrete seq-like thing with actual elements, then great, the function can read off an element from that and make progress.
But if the only result of peeling off those layers is that more layers are revealed, and it's lazy-seqs all the way down, well… There are no elements to be found. And since in principle there is no way to determine whether by peeling off sufficiently many layers some elements could eventually be produced (cf. the halting problem), the function consuming an unrealizable lazy seq of this sort has in general no choice but to continue looping forever.
To take another angle, let's consider your even-numbers-v2 function. It takes an argument and returns a lazy-seq object wrapping a further call to itself. Now, the original argument it receives (n) is used to compute the argument to the recursive call ((+ 2 n)), but otherwise isn't placed in any data structure or otherwise conveyed to the caller, so there is no reason why it would occur as an element of the resulting seq. All the caller sees is that the function has produced a lazy seq object and it has no choice but to unwrap that in search for an actual element of the sequence; and of course then the situation repeats itself (not strictly forever in this case, but only because + will eventually complain about arithmetic overflow when dealing with longs).

Why clojure collections don't implement ISeq interface directly?

Every collection in clojure is said to be "sequable" but only list and cons are actually seqs:
user> (seq? {:a 1 :b 2})
false
user> (seq? [1 2 3])
false
All other seq functions first convert a collection to a sequence and only then operate on it.
user> (class (rest {:a 1 :b 2}))
clojure.lang.PersistentArrayMap$Seq
I cannot do things like:
user> (:b (rest {:a 1 :b 2}))
nil
user> (:b (filter #(-> % val (= 1)) {:a 1 :b 1 :c 2}))
nil
and have to coerce back to concrete data type. This looks like bad design to me, but most likely I just don't get it as yet.
So, why clojure collections don't implement ISeq interface directly and all seq functions don't return an object of the same class as the input object?
This has been discussed on the Clojure google group; see for example the thread map semantics from February of this year. I'll take the liberty of reusing some of the points I made in my message to that thread below while adding several new ones.
Before I go on to explain why I think the "separate seq" design is the correct one, I would like to point out that a natural solution for the situations where you'd really want to have an output similar to the input without being explicit about it exists in the form of the function fmap from the contrib library algo.generic. (I don't think it's a good idea to use it by default, however, for the same reasons for which the core library design is a good one.)
Overview
The key observation, I believe, is that the sequence operations like map, filter etc. conceptually divide into three separate concerns:
some way of iterating over their input;
applying a function to each element of the input;
producing an output.
Clearly 2. is unproblematic if we can deal with 1. and 3. So let's have a look at those.
Iteration
For 1., consider that the simplest and most performant way to iterate over a collection typically does not involve allocating intermediate results of the same abstract type as the collection. Mapping a function over a chunked seq over a vector is likely to be much more performant than mapping a function over a seq producing "view vectors" (using subvec) for each call to next; the latter, however, is the best we can do performance-wise for next on Clojure-style vectors (even in the presence of RRB trees, which are great when we need a proper subvector / vector slice operation to implement an interesting algorithm, but make traversals terrifying slow if we used them to implement next).
In Clojure, specialized seq types maintain traversal state and extra functionality such as (1) a node stack for sorted maps and sets (apart from better performance, this has better big-O complexity than traversals using dissoc / disj!), (2) current index + logic for wrapping leaf arrays in chunks for vectors, (3) a traversal "continuation" for hash maps. Traversing a collection through an object like this is simply faster than any attempt at traversing through subvec / dissoc / disj could be.
Suppose, however, that we're willing to accept the performance hit when mapping a function over a vector. Well, let's try filtering now:
(->> some-vector (map f) (filter p?))
There's a problem here -- there's no good way to remove elements from a vector. (Again, RRB trees could help in theory, but in practice all the RRB slicing and concatenating involved in producing "real vector" for filtering operations would absolutely destroy performance.)
Here's a similar problem. Consider this pipeline:
(->> some-sorted-set (filter p?) (map f) (take n))
Here we benefit from laziness (or rather, from the ability to stop filtering and mapping early; there's a point involving reducers to be made here, see below). Clearly take could be reordered with map, but not with filter.
The point is that if it's ok for filter to convert to seq implicitly, then it is also ok for map; and similar arguments can be made for other sequence functions. Once we've made the argument for all -- or nearly all -- of them, it becomes clear that it also makes sense for seq to return specialized seq objects.
Incidentally, filtering or mapping a function over a collection without producing a similar collection as a result is very useful. For example, often we care only about the result of reducing the sequence produced by a pipeline of transformations to some value or about calling a function for side effect at each element. For these scenarios, there is nothing whatsoever to be gained by maintaining the input type and quite a lot to be lost in performance.
Producing an output
As noted above, we do not always want to produce an output of the same type as the input. When we do, however, often the best way to do so is to do the equivalent of pouring a seq over the input into an empty output collection.
In fact, there is absolutely no way to do better for maps and sets. The fundamental reason is that for sets of cardinality greater than 1 there is no way to predict the cardinality of the output of mapping a function over a set, since the function can "glue together" (produce the same outputs for) arbitrary inputs.
Additionally, for sorted maps and sets there is no guarantee that the input set's comparator will be able to deal with outputs from an arbitrary function.
So, if in many cases there is no way to, say, map significantly better than by doing a seq and an into separately, and considering how both seq and into make useful primitives in their own right, Clojure makes the choice of exposing the useful primitives and letting users compose them. This lets us map and into to produce a set from a set, while leaving us the freedom to not go on to the into stage when there is no value to be gained by producing a set (or another collection type, as the case may be).
Not all is seq; or, consider reducers
Some of the problems with using the collection types themselves when mapping, filtering etc. don't apply when using reducers.
The key difference between reducers and seqs is that the intermediate objects produced by clojure.core.reducers/map and friends only produce "descriptor" objects that maintain information on what computations need to be performed in the event that the reducer is actually reduced. Thus, individual stages of the computation can be merged.
This allows us to do things like
(require '[clojure.core.reducers :as r])
(->> some-set (r/map f) (r/filter p?) (into #{}))
Of course we still need to be explicit about our (into #{}), but this is just a way of saying "the reducers pipeline ends here; please produce the result in the form of a set". We could also ask for a different collection type (a vector of results perhaps; note that mapping f over a set may well produce duplicate results and we may in some situations wish to preserve them) or a scalar value ((reduce + 0)).
Summary
The main points are these:
the fastest way to iterate over a collection typically doesn't involve produce intermediate results similar to the input;
seq uses the fastest way to iterate;
the best approach to transforming a set by mapping or filtering involves using a seq-style operation, because we want to iterate very fast while accumulating an output;
thus seq makes a great primitive;
map and filter, in their choice to deal with seqs, depending on the scenario, may avoid performance penalties without upsides, benefit from laziness etc., yet can still be used to produce a collection result with into;
thus they too make great primitives.
Some of these points may not apply to a statically typed language, but of course Clojure is dynamic. Additionally, when we do want to a return that matches input type, we're simply forced to be explicit about it and that, in itself, may be viewed as a good thing.
Sequences are a logical list abstraction. They provide access to a (stable) ordered sequence of values. They are implemented as views over collections (except for lists where the concrete interface matches the logical interface). The sequence (view) is a separate data structure that refers into the collection to provide the logical abstraction.
Sequence functions (map, filter, etc) take a "seqable" thing (something which can produce a sequence), call seq on it to produce the sequence, and then operate on that sequence, returning a new sequence. It is up to you whether you need to or how to re-collect that sequence back into a concrete collection. While vectors and lists are ordered, sets and maps are not and thus sequences over these data structures must compute and retain the order outside the collection.
Specialized functions like mapv, filterv, reduce-kv allow you to stay "in the collection" when you know you want the operation to return a collection at the end instead of sequence.
Seqs are ordered structures, whereas maps and sets are unordered. Two maps that are equal in value may have a different internal ordering. For example:
user=> (seq (array-map :a 1 :b 2))
([:a 1] [:b 2])
user=> (seq (array-map :b 2 :a 1))
([:b 2] [:a 1])
It makes no sense to ask for the rest of a map, because it's not a sequential structure. The same goes for a set.
So what about vectors? They're sequentially ordered, so we could potentially map across a vector, and indeed there is such a function: mapv.
You may well ask: why is this not implicit? If I pass a vector to map, why doesn't it return a vector?
Well, first that would mean making an exception for ordered structures like vectors, and Clojure isn't big on making exceptions.
But more importantly you'd lose one of the most useful properties of seqs: laziness. Chaining together seq functions, such as map and filter is a very common operation, and without laziness this would be much less performant and far more memory-intensive.
The collection classes follow a factory pattern i.e instead of implementing ISeq they implement Sequable i.e you can create a ISeq from the collection but the collection itself is not an ISeq.
Now even if these collections implemented ISeq directly I am not sure how that would solve your problem of having general purpose sequence functions that would return the original object, as that would not make sense at all as these general purpose functions are supposed to work on ISeq, they have no idea about which object gave them this ISeq
Example in java:
interface ISeq {
....
}
class A implements ISeq {
}
class B implements ISeq {
}
static class Helpers {
/*
Filter can only work with ISeq, that's what makes it general purpose.
There is no way it could return A or B objects.
*/
public static ISeq filter(ISeq coll, ...) { }
...
}

For what does clojure implement implicit conversion between, e.g. a vector to a list?

If I do
user => (next [1 2 3])
I get
(2 3)
It seems that an implicit conversion between vector and list is being operated.
Conceptually, applying next on a vector does not make a lot of sense because a vector is not a sequence. Indeed Clojure does not implement next for a vector. When I apply next on a vector, Clojure kindly suggests that "You wanted to say (next seq), right?".
Isn't it more straight forward to say that a vector does not have next method? What can be reasons why this implicit conversion is more advantageous and/or necessary?
If you look at the docs, next says:
Returns a seq of the items after the first. Calls seq on its argument.
If there are no more items, returns nil.
meaning that this method calls seq on the collection you give it (in your case, its a vector), and it returns a seq containing the rest.
In clojure, lots of things are "colls", such as sequences, vectors, sets and even maps, so for example, this would also work:
(next {:a 1 :b 2}) ; returns ([:b 2])
so the behavior is consistent - transform any collection of items into a seq. This is very common in clojure, map and partition for example do the same thing:
(map inc [1 2 3]) ; returns (2 3 4)
(partition 2 [1 2 3 4]) ; returns ((1 2)(3 4))
this is useful for two main reasons (more are welcome!):
it allows these core functions to operate on any data type you throw at them, as long as it is a "collection"
it allows for lazy computation, eg. even if try to map a large vector but you only asked for the first few items, map wont have to actually pre-compute all items.
Clojure has the concept of a sequence (which just happens to display the same as a list.
next is a function that makes sense on any collection that is a sequence (or can reasonably be coerced into one).
(type '(1 2 3))
=> clojure.lang.PersistentList
(type (rest [1 2 3]))
=>clojure.lang.PersistentVector$ChunkedSeq
There are tradeoffs in the design of any language or library. Allowing the same operation to work on different collection types makes it easier to write many programs. You often don't have to worry about differences between lists and vectors if you don't want to worry about them. If you decide you want to use one sequence type rather than another, you might be able to leave all of the rest of the code as it was. This is all implicit in Shlomi's answer, which also points out an advantage involving laziness.
There are disadvantages to Clojure's strategy, too. Clojure's flexible operations on collections mean that Clojure might not tell you that you have mistakenly used a collection type that you didn't intend. Other languages lack Clojure's flexibility, but might help you catch certain kinds of bugs more quickly. Some statically typed languages, such as Standard ML, for example, take this to an extreme--which is a good thing for certain purposes, but bad for others.
Clojure lets you control performance / abstractions operating a choice between list and vector.
List
is fast on operations at the beginning of the sequence like cons / conj
is fast on iteration with first / rest
Vector
is fast on operations at the end of the sequence like pop / peek
participates in associative abstraction with indexes as keys
is fast on subvec
Both participate in sequence abstraction. Clojure functions and conversions they operate, are made to ease idiomatic code writing.

How Are Lazy Sequences Implemented in Clojure?

I like Clojure. One thing that bothers me about the language is that I don't know how lazy sequences are implemented, or how they work.
I know that lazy sequences only evaluate the items in the sequence that are asked for. How does it do this?
What makes lazy sequences so efficient that they don't consume much
stack?
How come you can wrap recursive calls in a lazy sequence and no
longer get a stack over flow for large computations?
What resources do lazy sequences consume to do what it does?
In what scenarios are lazy sequences inefficient?
In what scenarios are lazy sequences most efficient?
Let's do this.
• I know that lazy sequences only evaluate the items in the sequence that are asked for, how does it do this?
Lazy sequences (henceforth LS, because I am a LP, or Lazy Person) are composed of parts. The head, or the part(s, as really 32 elements are evaluated at a time, as of Clojure 1.1, and I think 1.2) of the sequence that have been evaluated, is followed by something called a thunk, which is basically a chunk of information (think of it as the rest of the your function that creates the sequence, unevaluated) waiting to be called. When it is called, the thunk evaluates however much is asked of it, and a new thunk is created, with context as necessary (how much has been called already, so it can resume from where it was before).
So you (take 10 (whole-numbers)) – assume whole-numbers is a lazy sequence of whole numbers. That means you're forcing evaluation of thunks 10 times (though internally this may be a little difference depending on optimizations.
• What makes lazy sequences so efficient that they don't consume much stack?
This becomes clearer once you read the previous answer (I hope): unless you call for something in particular, nothing is evaluated. When you call for something, each element of the sequence can be evaluated individually, then discarded.
If the sequence is not lazy, oftentimes it is holding onto its head, which consumes heap space. If it is lazy, it is computed, then discarded, as it is not required for subsequent computations.
• How come you can wrap recursive calls in a lazy sequence and no longer get a stack over flow for large computations?
See the previous answer and consider: the lazy-seq macro (from the documentation) will
will invoke the body only the first time seq
is called, and will cache the result and return it on all subsequent
seq calls.
Check out the filter function for a cool LS that uses recursion:
(defn filter
"Returns a lazy sequence of the items in coll for which
(pred item) returns true. pred must be free of side-effects."
[pred coll]
(let [step (fn [p c]
(when-let [s (seq c)]
(if (p (first s))
(cons (first s) (filter p (rest s)))
(recur p (rest s)))))]
(lazy-seq (step pred coll))))
• What resources do lazy sequences consume to do what it does?
I'm not quite sure what you're asking here. LSs require memory and CPU cycles. They just don't keep banging the stack, and filling it up with results of the computations required to get the sequence elements.
• In what scenarios are lazy sequences inefficient?
When you're using small seqs that are fast to compute and won't be used much, making it an LS is inefficient because it requires another couple chars to create.
In all seriousness, unless you're trying to make something extremely performant, LSs are the way to go.
• In what scenarios are lazy sequences most efficient?
When you're dealing with seqs that are huge and you're only using bits and pieces of them, that is when you get the most benefit from using them.
Really, it's pretty much always better to use LSs over non-LSs, in terms of convenience, ease of understanding (once you get the hang of them) and reasoning about your code, and speed.
I know that lazy sequences only evaluate the items in the sequence that are asked for, how does it do this?
I think the previously posted answers already do a good job explaining this part. I'll only add that the "forcing" of a lazy sequence is an implicit -- paren-free! :-) -- function call; perhaps this way of thinking about it will make some things clearer. Also note that forcing a lazy sequence involves a hidden mutation -- the thunk being forced needs to produce a value, store it in a cache (mutation!) and throw away its executable code, which will not be required again (mutation again!).
I know that lazy sequences only evaluate the items in the sequence that are asked for, how does it do this?
What makes lazy sequences so efficient that they don't consume much stack?
What resources do lazy sequences consume to do what it does?
They don't consume stack, because they consume heap instead. A lazy sequence is a data structure, living on the heap, which contains a small bit of executable code which can be called to produce more of the data structure if/when that is required.
How come you can wrap recursive calls in a lazy sequence and no longer get a stack over flow for large computations?
Firstly, as mentioned by dbyrne, you can very well get an SO when working with lazy sequences if the thunks themselves need to execute code with a very deeply nested call structure.
However, in a certain sense you can use lazy seqs in place of tail recursion, and to the degree that this works for you you can say that they help in avoiding SOs. In fact, rather importantly, functions producing lazy sequences should not be tail recursive; the conservation of stack space with lazy seq producers arises from the aforementioned stack -> heap transfer and any attempts to write them in a tail recursive fashion will only break things.
The key insight is that a lazy sequence is an object which, when first created, doesn't hold any items (as a strict sequence always does); when a function returns a lazy sequence, only this "lazy sequence object" is returned to the caller, before any forcing takes place. Thus the stack frame used up by the call which returned the lazy sequence is popped before any forcing takes place. Let's have a look at an example producer function:
(defn foo-producer [] ; not tail recursive...
(lazy-seq
(cons :foo ; because it returns the value of the cons call...
(foo-producer)))) ; which wraps a non-tail self-call
This works because lazy-seq returns immediately, thus (cons :foo (foo-producer)) also returns immediately and the stack frame used up by the outer call to foo-producer is immediately popped. The inner call to foo-producer is hidden in the rest part of the sequence, which is a thunk; if/when that thunk is forced, it will briefly use up its own frame on the stack, but then return immediately as described above etc.
Chunking (mentioned by dbyrne) changes this picture very slightly, because a larger number of elements gets produced at each step, but the principle remains the same: each step used up some stack when the corresponding elements of the lazy seq are being produced, then that stack is reclaimed before more forcing takes place.
In what scenarios are lazy sequences inefficient?
In what scenarios are lazy sequences most efficient?
There's no point to being lazy if you need to hold the entire thing at once anyway. A lazy sequence makes a heap allocation at every step when not chunked or at every chunk -- once every 32 steps -- when chunked; avoiding that can net you a performance gain in some situations.
However, lazy sequences enable a pipelined mode of data processing:
(->> (lazy-seq-producer) ; possibly (->> (range)
(a-lazy-seq-transformer-function) ; (filter even?)
(another-transformer-function)) ; (map inc))
Doing this in a strict way would allocate plenty of heap anyway, because you'd have to keep the intermediate results around to pass them to the next processing stage. Moreover, you'd need to keep the whole thing around, which is actually impossible in the case of (range) -- an infinite sequence! -- and when it is possible, it is usually inefficient.
Originally, lazy sequences in Clojure were evaluated item-by-item as they were needed. Chunked sequences were added in Clojure 1.1 to improve performance. Instead of item-by-item evaluation, "chunks" of 32 elements are evaluated at a time. This reduces the overhead that lazy evaluation incurs. Also, it allows clojure to take advantage of the underlying data structures. For example, PersistentVector is implemented as a tree of 32 element arrays. This means that to access an element, you must traverse the tree until the appropriate array is found. With chunked sequences, entire arrays are grabbed at a time. This means each of the 32 elements can be retrieved before the tree needs to be re-traversed.
There has been discussion about providing a way to force item-by-item evaluation in situations where full laziness is required. However, I don't think it has been added to the language yet.
How come you can wrap recursive calls in a lazy sequence and no longer get a stack over flow for large computations?
Do you have an example of what you mean? If you have a recursive binding to a lazy-seq, it can definitely cause a stack overflow.