What is the difference between clojure's APersistentMap implementations - clojure

I'm trying to figure out what the difference is between a PersistentHashMap, PersistentArrayMap, PersistentTreeMap, and PersistentStructMap.
Also if I use {:a 1} it gives me a PersistentArrayMap but can this change to any of the other ones if I give it objects or things other than keys?

The four implementations you list fall into three groups:
"literal": PersistentArrayMap and PersistentHashMap: basic map types used when dealing with map literals (though constructor functions are also available with different behaviour around handling duplicate keys -- in Clojure 1.5.x literals throw exceptions when they discover duplicate keys, constructor functions work like left-to-right repeated conjing; this behaviour has been evolving from version to version). Array maps get promoted to hash maps when growing beyond a certain number of entries (9 IIRC). Array maps exist because they are faster for small maps; they also differ from hash maps in that they keep entries in insertion order prior to promotion to hash map (you can use clojure.core/array-map to get arbitrarily large array maps, which may be useful if you really know you'd benefit from insertion-order traversals and the map won't be too large, perhaps just a bit over the usual threshold; NB. a subsequent assoc to such an oversized array map will return a hash map). Array maps use arrays with keys and values interleaved; the PHM uses a persistent version of Phil Bagwell's hash array mapped trie with separate chaining for hash collisions and separate node types for mostly-empty and at-least-half-full nodes and is easily the most complex data structure in Clojure.
sorted: PersistentTreeMap instances are created by special request only (a call to sorted-map or sorted-map-by). They are implemented as red-black trees and maintain entries in a particular order, as specified by the default compare comparator if created with sorted-map or a user-supplied comparator if created with sorted-map-by.
special-purpose, probably deprecated: PersistentStructMap is not used very often and mostly viewed as deprecated in favour of records, although I actually can't remember right now if there ever was an official deprecation notice. The original purpose was to provide maps with particularly fast access to certain often-used keys. This can now be accomplished with records when using keywords for field access (with the keyword in the operator position: (:foo instance-of-some-record-with-field-foo)), though it's important to note that records are not = to regular maps with the same entries.
All these four built-in map types fall into the same "equality partition", that is, any two maps of one of the four classes mentioned above will be equal if (and only if) they contain the same keys (as determined by Clojure's =) with the same corresponding values. Records, as mentioned in 3. above, are map-like, but each record type forms its own equality partition.

They are different implementation of a Persistent Map (they all extend APersistentMap). So a PersistentArrayMap uses an array as the underlying data structure to implement persistent map and similarly other implementations uses different underlying data stucture.
The reason for different implementation is they provide different benefits in different situations (as the efficiency of the implementation depends on the underlying data structure).
From a developer perspective, it is abstracted away and hence you should not be directly using
them and instead work with the APersistentMap abstract class or IPersistentMap interface (in case some type checking etc is required for some specific case).
Depending on the number of elements in the map the various implementations are used.
(type (into {} (map #(-> [% %]) (range 5))))
=> PersistentArrayMap
(type (into {} (map #(-> [% %]) (range 10))))
=> PersistentHashMap

Related

Clojure sequences and collections

In Lisp all data structures builds of cons cells, i.e they are essentially linked lists or binary trees or both (correct me if I'm wrong). Clojure data structures are lists, vectors, maps and sets. Clojure incorporates two inclusive abstractions for these data structures: collections and sequences. Sequence abstraction defines first, rest and cons operations, where as collection abstraction define collection specific operations such as conj and into.
Clojure core functions such as map and filter operates on sequence abstraction but accepts any data structure and performs implicit conversion. These functions are also lazy. Does this mean by default Clojure internally stores data in more efficient data structures such as indexed arrays and only switches to linked lists as needed? How does Clojure actually convert collections to sequences? Is the sequence built from collection using iterator in a streaming fashion or as a whole and then passed to the consumer?
The only data structure in Clojure that is a singly-linked list is an actual list like:
(list 1 2 3)
Everything else is an efficient data structure (i.e. vector, map).
A lazy-sequence is (nominally) composed of the current value and a recipe for generating the next value. Once computed, elements are cached and are not re-computed.
Conversion of a collection to a sequence is an implementation detail and is not normally important to the end user.
The original map and filter functions are lazy, as are many others. However, this was enough of a headache (unpredictable time of realization) that eager/imperative versions mapv and filterv were added to the language.

Can I overload the count function in Clojure

In Python if I want to customize the way to define how to find the size of an object I define a __len__ method for the class, which alters the behavior of the len function. Is there any way to do this in Clojure with the count function? If so, how?
This is a reasonable question to ask when you are moving from one paradigm to
another i.e. OO to functional, but likely is not an issue. In languages like
Clojure, you will normally start by working out how to represent your data in
one of the existing data structures (Collections) rather than define a new
structure to represent your data. The idea is to have only a few different data
structures with a large number of well defined and understood functions which
operate on those structures in an efficient and reliable manner. As a
consequence, once you have decided on how to represent your graphs, you will
likely find that doing something like counting the number of vertices is
trivial - you will probably just need to use reduce with a very simple function
that in turn uses other core functions, such as the existing count(). A key to
becoming proficient in Clojure is in learning what the core functions are and
what they do.
In an OO language, you encapsulate your data and hide it inside an object. This
means you now need to define the methods to operate on the data inside that
object. To support polymorphism, you will frequently do things like add an
object specific size or length method. In Clojure, the underlying data structure
is not hidden and is frequently built on one of the standard collection types,
so you don't need tow rite a size/length function, you can use the standard
count function. When the standard collections are not suitable and you need
more, then you tend to look at things like protocols, where you can define your
own specialised functions e.g. size.
In your case, a protocol or record is unlikely to be necessary - representing
graphs is pretty much a natural fit for the standard collections and I woldn't
be at all surprised if you could re-implement what you did in C or C++ with
Clojure in a lot fewer lines and likely in a much more declarative and cleaner
manner. Start by looking at how the standard Clojure collection types could be
used to represent your graphs. Think about how you want to operate on the graphs
and whether you are best representing the graph as nodes or verticies and then
look at how you would answer questions like 'How many verticies are in this
graph?" and see how you would get that answer just using the available built-in
functions.
You do need to think about things differently when you move to a functional
paradigm. There will be a point you get to that is a real "Aha" moment as that
penny drops. Once it does, you will likely be very surprised how nice it is, but
until that moment, you will likely experience a bit of frustration and hair
pulling. The battle is worth it as you will likely find even your old imparative
programming skills benefit and when you have to go back to C/C++ or Python, your
code is even clearer and more concise. Try to avoid the temptation to reproduce
what you did in C/Python in Clojure. instead, focus on the outcome you want to
achieve and see how the supplied facilities of the language will help you do
that.
Your comment says you are dealing with graphs. Taking on board the good advice to use the standard data structures, let's consider how to represent them.
You would normally represent a graph as a map Node -> Collection of Node. For example,
(def graph {:a [:b], :b [:a :c]})
Then
(count graph)
=> 2
However, if you make sure that every node has a map entry, even the ones that have no afferent arcs, then all you have to do is count the graph's map. A function add the empty entries is ...
(defn add-empties [gm]
(if (empty? gm)
gm
(let [EMPTY (-> gm first val empty)
missing (->> gm
vals
(apply concat)
(remove gm))]
(reduce (fn [acc x] (assoc acc x EMPTY)) gm missing))))
For example,
(add-empties graph)
=> {:a [:b], :b [:a :c], :c []}
and
(count(add-empties graph))
=> 3
What does count mean for a graph?
What should count return for a graph? I can think of two equally obvious options -- the number of nodes in the graph or the number of edges. So perhaps count isn't very intuitive for graph.
Implementing Counted
Nevertheless, you certainly can define your own counted data structures in Clojure. The key is to implement the clojure.lang.Counted interface.
Now, if represent a graph via the following deftype:
(deftype NodeToEdgeGraph [node->neighbors]
...)
we can make it counted:
(deftype NodeToEdgeGraph [node->neighbors]
clojure.lang.Counted
(count [this]
(count node->neighbors))
...)
This is if we are representing a graph as a map that maps each node to its set of "neighbors" (where a node is considered a "neighbor" if, and only if, there is an edge between the two), and we want count to return the number of nodes.
Alternatively, we can represent a graph as a set of pairs (either ordered, in the case of a directed graph; or unordered, in the case of an undirected graph). In this case, we have:
(deftype EdgeGraph [edges]
...)
And we can have count return the number of edges in the graph:
(deftype EdgeGraph [edges]
clojure.lang.Counted
(count [this]
(count edges))
...)
So far, we have been using count on the underlying structure to implement count for the graph. This works because the underlying structure conveniently has the same count as the way we are counting each respective graph representation. However, there's no reason we couldn't use either representation with either way of counting. Also, there are other ways of representing graphs that might not align so nicely with the way we want to count. In these cases, we can always maintain our own internal count:
(deftype ???Graph [cnt ???]
clojure.lang.Counted
(count [this]
cnt)
...)
Here, we rely on the implementation of the graph operations to maintain cnt appropriately.
Built-in or Custom Types?
Others have suggested using the built-in datastructures only to represent your graphs. Indeed, if you take a look at NodeGraph and EdgeGraph, you can see that count simply delegates to the underlying structure. So for these representations (and definitions of count), you could simply eliminate the deftype and implement your graph operations directly over maps / sets. However, let's take a look at some advantages to encapsulating graphs in a custom type:
Ease of implementing alternative representations. With custom types, you can have an unlimited number of graph representations. All you need to do is define some protocols and define graph operations in terms of those protocols. While you can also extend protocols to built-in types, you are limited to only one representation / implementaton per built-in type. This might or might not be sufficient for you.
Maintenance of internal invariants. With custom types, you have better control over the internal structure. That is, you can implement protocols / interfaces in a way that maintains any necessary internal invariants. For example, if you are implementing undirected graphs, you can implement NodeGraph in a way that ensures adding an edge from node A to node B also adds an edge from node B to node A. Likewise with removing an edge. With built-in types, these invariants can only be maintained by the graph functions you implement. The Clojure core functions will not maintain these invariants for you, because they know nothing about them. So if you hand your "graphs" (over built-in types) off to some function that calls non-graph functions on them, you have no assurance that you will get a valid graph back. On the other have, custom types only allow the operations you implement, and will perform them the way you implement them. As long as you take care that all the operations you implement maintain the proper invariants, you can rest assured that these invariants will always be maintained.
On the other hand, sometimes it is appropriate to simply use the built-in types. For instance, if your application only makes light use of graph operations, it might be more convenient to simply use built-in types. On the other hand, if your application is heavily graph-oriented and makes a lot of graph operations, it's probably better in the long run to implement a custom type.
Of course, you are not forced to choose one over the other. With protocols, you can implement graph operations for both built-in types as well as custom types. That way, you can choose between "quick and dirty" and "heavy but robust" depending on your application. Here "heavy" just means it might take a little more work to use graphs with functions implemented over Clojure collection interfaces. In other words, you might need to convert graphs to other types in some cases. This depends heavily on how much effort you put into implementing these interfaces for your custom graph types (and to the extent they make sense for graphs).
By the way, you cannot override that function neither with with-redefs nor any related functionality. There is a hidden trick here: if you check the source code of the count function, you'll see an interesting inline attribute in its metadata:
(defn count
"Returns the number of items in the collection. (count nil) returns
0. Also works on strings, arrays, and Java Collections and Maps"
{
:inline (fn [x] `(. clojure.lang.RT (count ~x)))
:added "1.0"}
[coll] (clojure.lang.RT/count coll))
This attribute means the function's code will be inserted as-is into the resulting bytecode without calling the original function. Most of the general core functions have such attribute. So that's why you cannot override count, get, int, nth and so forth.
sorry, not for existing collection types you don't control.
You can define your own count that is aware of your needs and use that in your code, though unfortunately clojure does not use a universal protocol for counting so there is nowhere for you to attach that will extend count on an existing collection.
If counted where a protocol rather than an interface this would be easier for you, though it long predates protocols in the evolution of the language.
If you are making your own collection type then you can of course implement count anyway you want.

Why are Clojure vectors used to pass key-value pairs?

As a newcomer to Clojure, the distinction between a vector (array-like) and a map (key-value pairs) initially seemed clear to me.
However, in a lot of situations (such as the "let" special form and functions with keyword arguments) a vector is used to pass key-value pairs.
The source code for let even includes a check to ensure that the vector contains an even number of elements.
I really don't understand why vectors are used instead of maps. When I read about the collection types, I would expect maps to be the preferred way to store any information in key-value format.
Can anyone explain me why vectors also seem to be the preferred tool to express pairs of keys and values?
The wonderful people at the Clojure IRC channel explained to me the primary reason: maps (hashes) are not ordered.
For example, the let form allows back-references which could break if the order of the arguments is not stable:
(let [a 1 b (inc a)] (+ a b))
The reason why ordered maps are not used
they have no convenient literal
vanilla Clojure has no ordered map
except one that is ordered by sorting keys (which would be weird).
Thus, the need to keep arguments in order trumps the fact that they are key-value pairs.

Why clojure collections don't implement ISeq interface directly?

Every collection in clojure is said to be "sequable" but only list and cons are actually seqs:
user> (seq? {:a 1 :b 2})
false
user> (seq? [1 2 3])
false
All other seq functions first convert a collection to a sequence and only then operate on it.
user> (class (rest {:a 1 :b 2}))
clojure.lang.PersistentArrayMap$Seq
I cannot do things like:
user> (:b (rest {:a 1 :b 2}))
nil
user> (:b (filter #(-> % val (= 1)) {:a 1 :b 1 :c 2}))
nil
and have to coerce back to concrete data type. This looks like bad design to me, but most likely I just don't get it as yet.
So, why clojure collections don't implement ISeq interface directly and all seq functions don't return an object of the same class as the input object?
This has been discussed on the Clojure google group; see for example the thread map semantics from February of this year. I'll take the liberty of reusing some of the points I made in my message to that thread below while adding several new ones.
Before I go on to explain why I think the "separate seq" design is the correct one, I would like to point out that a natural solution for the situations where you'd really want to have an output similar to the input without being explicit about it exists in the form of the function fmap from the contrib library algo.generic. (I don't think it's a good idea to use it by default, however, for the same reasons for which the core library design is a good one.)
Overview
The key observation, I believe, is that the sequence operations like map, filter etc. conceptually divide into three separate concerns:
some way of iterating over their input;
applying a function to each element of the input;
producing an output.
Clearly 2. is unproblematic if we can deal with 1. and 3. So let's have a look at those.
Iteration
For 1., consider that the simplest and most performant way to iterate over a collection typically does not involve allocating intermediate results of the same abstract type as the collection. Mapping a function over a chunked seq over a vector is likely to be much more performant than mapping a function over a seq producing "view vectors" (using subvec) for each call to next; the latter, however, is the best we can do performance-wise for next on Clojure-style vectors (even in the presence of RRB trees, which are great when we need a proper subvector / vector slice operation to implement an interesting algorithm, but make traversals terrifying slow if we used them to implement next).
In Clojure, specialized seq types maintain traversal state and extra functionality such as (1) a node stack for sorted maps and sets (apart from better performance, this has better big-O complexity than traversals using dissoc / disj!), (2) current index + logic for wrapping leaf arrays in chunks for vectors, (3) a traversal "continuation" for hash maps. Traversing a collection through an object like this is simply faster than any attempt at traversing through subvec / dissoc / disj could be.
Suppose, however, that we're willing to accept the performance hit when mapping a function over a vector. Well, let's try filtering now:
(->> some-vector (map f) (filter p?))
There's a problem here -- there's no good way to remove elements from a vector. (Again, RRB trees could help in theory, but in practice all the RRB slicing and concatenating involved in producing "real vector" for filtering operations would absolutely destroy performance.)
Here's a similar problem. Consider this pipeline:
(->> some-sorted-set (filter p?) (map f) (take n))
Here we benefit from laziness (or rather, from the ability to stop filtering and mapping early; there's a point involving reducers to be made here, see below). Clearly take could be reordered with map, but not with filter.
The point is that if it's ok for filter to convert to seq implicitly, then it is also ok for map; and similar arguments can be made for other sequence functions. Once we've made the argument for all -- or nearly all -- of them, it becomes clear that it also makes sense for seq to return specialized seq objects.
Incidentally, filtering or mapping a function over a collection without producing a similar collection as a result is very useful. For example, often we care only about the result of reducing the sequence produced by a pipeline of transformations to some value or about calling a function for side effect at each element. For these scenarios, there is nothing whatsoever to be gained by maintaining the input type and quite a lot to be lost in performance.
Producing an output
As noted above, we do not always want to produce an output of the same type as the input. When we do, however, often the best way to do so is to do the equivalent of pouring a seq over the input into an empty output collection.
In fact, there is absolutely no way to do better for maps and sets. The fundamental reason is that for sets of cardinality greater than 1 there is no way to predict the cardinality of the output of mapping a function over a set, since the function can "glue together" (produce the same outputs for) arbitrary inputs.
Additionally, for sorted maps and sets there is no guarantee that the input set's comparator will be able to deal with outputs from an arbitrary function.
So, if in many cases there is no way to, say, map significantly better than by doing a seq and an into separately, and considering how both seq and into make useful primitives in their own right, Clojure makes the choice of exposing the useful primitives and letting users compose them. This lets us map and into to produce a set from a set, while leaving us the freedom to not go on to the into stage when there is no value to be gained by producing a set (or another collection type, as the case may be).
Not all is seq; or, consider reducers
Some of the problems with using the collection types themselves when mapping, filtering etc. don't apply when using reducers.
The key difference between reducers and seqs is that the intermediate objects produced by clojure.core.reducers/map and friends only produce "descriptor" objects that maintain information on what computations need to be performed in the event that the reducer is actually reduced. Thus, individual stages of the computation can be merged.
This allows us to do things like
(require '[clojure.core.reducers :as r])
(->> some-set (r/map f) (r/filter p?) (into #{}))
Of course we still need to be explicit about our (into #{}), but this is just a way of saying "the reducers pipeline ends here; please produce the result in the form of a set". We could also ask for a different collection type (a vector of results perhaps; note that mapping f over a set may well produce duplicate results and we may in some situations wish to preserve them) or a scalar value ((reduce + 0)).
Summary
The main points are these:
the fastest way to iterate over a collection typically doesn't involve produce intermediate results similar to the input;
seq uses the fastest way to iterate;
the best approach to transforming a set by mapping or filtering involves using a seq-style operation, because we want to iterate very fast while accumulating an output;
thus seq makes a great primitive;
map and filter, in their choice to deal with seqs, depending on the scenario, may avoid performance penalties without upsides, benefit from laziness etc., yet can still be used to produce a collection result with into;
thus they too make great primitives.
Some of these points may not apply to a statically typed language, but of course Clojure is dynamic. Additionally, when we do want to a return that matches input type, we're simply forced to be explicit about it and that, in itself, may be viewed as a good thing.
Sequences are a logical list abstraction. They provide access to a (stable) ordered sequence of values. They are implemented as views over collections (except for lists where the concrete interface matches the logical interface). The sequence (view) is a separate data structure that refers into the collection to provide the logical abstraction.
Sequence functions (map, filter, etc) take a "seqable" thing (something which can produce a sequence), call seq on it to produce the sequence, and then operate on that sequence, returning a new sequence. It is up to you whether you need to or how to re-collect that sequence back into a concrete collection. While vectors and lists are ordered, sets and maps are not and thus sequences over these data structures must compute and retain the order outside the collection.
Specialized functions like mapv, filterv, reduce-kv allow you to stay "in the collection" when you know you want the operation to return a collection at the end instead of sequence.
Seqs are ordered structures, whereas maps and sets are unordered. Two maps that are equal in value may have a different internal ordering. For example:
user=> (seq (array-map :a 1 :b 2))
([:a 1] [:b 2])
user=> (seq (array-map :b 2 :a 1))
([:b 2] [:a 1])
It makes no sense to ask for the rest of a map, because it's not a sequential structure. The same goes for a set.
So what about vectors? They're sequentially ordered, so we could potentially map across a vector, and indeed there is such a function: mapv.
You may well ask: why is this not implicit? If I pass a vector to map, why doesn't it return a vector?
Well, first that would mean making an exception for ordered structures like vectors, and Clojure isn't big on making exceptions.
But more importantly you'd lose one of the most useful properties of seqs: laziness. Chaining together seq functions, such as map and filter is a very common operation, and without laziness this would be much less performant and far more memory-intensive.
The collection classes follow a factory pattern i.e instead of implementing ISeq they implement Sequable i.e you can create a ISeq from the collection but the collection itself is not an ISeq.
Now even if these collections implemented ISeq directly I am not sure how that would solve your problem of having general purpose sequence functions that would return the original object, as that would not make sense at all as these general purpose functions are supposed to work on ISeq, they have no idea about which object gave them this ISeq
Example in java:
interface ISeq {
....
}
class A implements ISeq {
}
class B implements ISeq {
}
static class Helpers {
/*
Filter can only work with ISeq, that's what makes it general purpose.
There is no way it could return A or B objects.
*/
public static ISeq filter(ISeq coll, ...) { }
...
}

In Clojure, when should trees of heterogenous node types be represented using records or vectors?

Which is better idiomatic clojure practice for representing a tree made up of different node types:
A. building trees out of several different types of records, that one defines using deftype or defrecord:
(defrecord node_a [left right])
(defrecord node_b [left right])
(defrecord leaf [])
(def my-tree (node_a. (node_b. (leaf.) (leaf.)) (leaf.)))
B. building trees out of vectors, with keywords designating the types:
(def my-tree [:node-a [:node-b :leaf :leaf] :leaf])
Most clojure code that I see seems to favor the usage of the general purpose data structures (vectors, maps, etc.), rather than datatypes or records. Hiccup, to take one example, represents html very nicely using the vector + keyword approach.
When should we prefer one style over the other?
You can put as many elements into a vector as you want. A record has a set number of fields. If you want to constrain your nodes to only have N sub-nodes, records might be good, e.g. making when a binary tree, where a node has to have only a Left and Right. But for something like HTML or XML, you probably want to support arbitrary numbers of sub-nodes.
Using vectors and keywords means that "extending" the set of supported node types is as simple as putting a new keyword into the vector. [:frob "foo"] is OK in Hiccup even if its author never heard of frobbing. Using records, you'd potentially have to define a new record for every node type. But then you get the benefit of catching typos and verifying subnodes. [:strnog "some bold text?"] isn't going to be caught by Hiccup, but (Strnog. "foo") would be a compile-time error.
Vectors being one of Clojure's basic data types, you can use Clojure's built-in functions to manipulate them. Want to extend your tree? Just conj onto it, or update-in, or whatever. You can build up your tree incrementally this way. With records, you're probably stuck with constructor calls, or else you have to write a ton of wrapper functions for the constructors.
Seems like this partly boils down to an argument of dynamic vs. static. Personally, I would go the dynamic (vector + keyword) route unless there was a specific need for the benefits of using records. It's probably easier to code that way, and it's more flexible for the user, at the cost of being easier for the user to end up making a mess. But Clojure users are likely used to having to handle dangerous weapons on a regular basis. Clojure being largely a dynamic language, staying dynamic is often the right thing to do.
This is a good question. I think both are appropriate for different kinds of problems. Nested vectors are a good solution if each node can contain a variable set of information - in particular templating systems are going to work well. Records are a good solution for a smallish number of fixed node types where nesting is far more constrained.
We do a lot of work with heterogeneous trees of records. Each node represents one of a handful of well-known types, each with a different set of known fixed keys. The reason records are better in this case is that you can pick the data out of the node by key which is O(1) (really a Java method call which is very fast), not O(n) (where you have to look through the node contents) and also generally easier to access.
Records in 1.2 are imho not quite "finished" but it's pretty easy to build that stuff yourself. We have a defrecord2 that adds constructor functions (new-foo), field validation, print support, pprint support, tree walk/edit support via zippers, etc.
An example of where we use this is to represent ASTs or execution plans where nodes might be things like Join, Sort, etc.
Vectors are going to be better for creating stuff like strings where an arbitrary number of things can be put in each node. If you can stuff 1+ <p>s inside a <div>, then you can't create a record that contains a :p field - that just doesn't make any sense. That's a case where vectors are far more flexible and idiomatic.