Why does clojure's group-by not always maintain order? - clojure

Why does (group-by identity (range 1 50)) return results like
{32 [32], 1 [1], 33 [33], 2 [2], 34 [34], 3 [3], 35 [35]...
Is it multi-threading related? Is there any way around it?
...and does it break it's contract?
Returns a map of the elements of coll keyed by the result of f on each element. The value at each key will be a vector of the corresponding elements, in the order they appeared in coll.

Try entering (type (group-by identity (range 1 50))) in your REPL. You can see that the result is actually a hash map (of class clojure.lang.PersistentHashMap). This means that it's unordered: in principle the REPL could output the printed map literal in any order of key/value pairs.
The actual reason why it's printed the way it's printed has to do with Clojure's implementation of a hash map -- in terms of data structures, it's actually a wide tree where each node can have up to 32 children (hence the initial 32 in your output; recall that Clojure vectors and maps are often cited to have lookup cost of O(log32N)). This blog article has a good summary.
And no, it doesn't violate the contract of group-by. The contract only specifies the ordering of the map's values' elements, not the ordering of the map per se.

Related

Why clojure collections don't implement ISeq interface directly?

Every collection in clojure is said to be "sequable" but only list and cons are actually seqs:
user> (seq? {:a 1 :b 2})
false
user> (seq? [1 2 3])
false
All other seq functions first convert a collection to a sequence and only then operate on it.
user> (class (rest {:a 1 :b 2}))
clojure.lang.PersistentArrayMap$Seq
I cannot do things like:
user> (:b (rest {:a 1 :b 2}))
nil
user> (:b (filter #(-> % val (= 1)) {:a 1 :b 1 :c 2}))
nil
and have to coerce back to concrete data type. This looks like bad design to me, but most likely I just don't get it as yet.
So, why clojure collections don't implement ISeq interface directly and all seq functions don't return an object of the same class as the input object?
This has been discussed on the Clojure google group; see for example the thread map semantics from February of this year. I'll take the liberty of reusing some of the points I made in my message to that thread below while adding several new ones.
Before I go on to explain why I think the "separate seq" design is the correct one, I would like to point out that a natural solution for the situations where you'd really want to have an output similar to the input without being explicit about it exists in the form of the function fmap from the contrib library algo.generic. (I don't think it's a good idea to use it by default, however, for the same reasons for which the core library design is a good one.)
Overview
The key observation, I believe, is that the sequence operations like map, filter etc. conceptually divide into three separate concerns:
some way of iterating over their input;
applying a function to each element of the input;
producing an output.
Clearly 2. is unproblematic if we can deal with 1. and 3. So let's have a look at those.
Iteration
For 1., consider that the simplest and most performant way to iterate over a collection typically does not involve allocating intermediate results of the same abstract type as the collection. Mapping a function over a chunked seq over a vector is likely to be much more performant than mapping a function over a seq producing "view vectors" (using subvec) for each call to next; the latter, however, is the best we can do performance-wise for next on Clojure-style vectors (even in the presence of RRB trees, which are great when we need a proper subvector / vector slice operation to implement an interesting algorithm, but make traversals terrifying slow if we used them to implement next).
In Clojure, specialized seq types maintain traversal state and extra functionality such as (1) a node stack for sorted maps and sets (apart from better performance, this has better big-O complexity than traversals using dissoc / disj!), (2) current index + logic for wrapping leaf arrays in chunks for vectors, (3) a traversal "continuation" for hash maps. Traversing a collection through an object like this is simply faster than any attempt at traversing through subvec / dissoc / disj could be.
Suppose, however, that we're willing to accept the performance hit when mapping a function over a vector. Well, let's try filtering now:
(->> some-vector (map f) (filter p?))
There's a problem here -- there's no good way to remove elements from a vector. (Again, RRB trees could help in theory, but in practice all the RRB slicing and concatenating involved in producing "real vector" for filtering operations would absolutely destroy performance.)
Here's a similar problem. Consider this pipeline:
(->> some-sorted-set (filter p?) (map f) (take n))
Here we benefit from laziness (or rather, from the ability to stop filtering and mapping early; there's a point involving reducers to be made here, see below). Clearly take could be reordered with map, but not with filter.
The point is that if it's ok for filter to convert to seq implicitly, then it is also ok for map; and similar arguments can be made for other sequence functions. Once we've made the argument for all -- or nearly all -- of them, it becomes clear that it also makes sense for seq to return specialized seq objects.
Incidentally, filtering or mapping a function over a collection without producing a similar collection as a result is very useful. For example, often we care only about the result of reducing the sequence produced by a pipeline of transformations to some value or about calling a function for side effect at each element. For these scenarios, there is nothing whatsoever to be gained by maintaining the input type and quite a lot to be lost in performance.
Producing an output
As noted above, we do not always want to produce an output of the same type as the input. When we do, however, often the best way to do so is to do the equivalent of pouring a seq over the input into an empty output collection.
In fact, there is absolutely no way to do better for maps and sets. The fundamental reason is that for sets of cardinality greater than 1 there is no way to predict the cardinality of the output of mapping a function over a set, since the function can "glue together" (produce the same outputs for) arbitrary inputs.
Additionally, for sorted maps and sets there is no guarantee that the input set's comparator will be able to deal with outputs from an arbitrary function.
So, if in many cases there is no way to, say, map significantly better than by doing a seq and an into separately, and considering how both seq and into make useful primitives in their own right, Clojure makes the choice of exposing the useful primitives and letting users compose them. This lets us map and into to produce a set from a set, while leaving us the freedom to not go on to the into stage when there is no value to be gained by producing a set (or another collection type, as the case may be).
Not all is seq; or, consider reducers
Some of the problems with using the collection types themselves when mapping, filtering etc. don't apply when using reducers.
The key difference between reducers and seqs is that the intermediate objects produced by clojure.core.reducers/map and friends only produce "descriptor" objects that maintain information on what computations need to be performed in the event that the reducer is actually reduced. Thus, individual stages of the computation can be merged.
This allows us to do things like
(require '[clojure.core.reducers :as r])
(->> some-set (r/map f) (r/filter p?) (into #{}))
Of course we still need to be explicit about our (into #{}), but this is just a way of saying "the reducers pipeline ends here; please produce the result in the form of a set". We could also ask for a different collection type (a vector of results perhaps; note that mapping f over a set may well produce duplicate results and we may in some situations wish to preserve them) or a scalar value ((reduce + 0)).
Summary
The main points are these:
the fastest way to iterate over a collection typically doesn't involve produce intermediate results similar to the input;
seq uses the fastest way to iterate;
the best approach to transforming a set by mapping or filtering involves using a seq-style operation, because we want to iterate very fast while accumulating an output;
thus seq makes a great primitive;
map and filter, in their choice to deal with seqs, depending on the scenario, may avoid performance penalties without upsides, benefit from laziness etc., yet can still be used to produce a collection result with into;
thus they too make great primitives.
Some of these points may not apply to a statically typed language, but of course Clojure is dynamic. Additionally, when we do want to a return that matches input type, we're simply forced to be explicit about it and that, in itself, may be viewed as a good thing.
Sequences are a logical list abstraction. They provide access to a (stable) ordered sequence of values. They are implemented as views over collections (except for lists where the concrete interface matches the logical interface). The sequence (view) is a separate data structure that refers into the collection to provide the logical abstraction.
Sequence functions (map, filter, etc) take a "seqable" thing (something which can produce a sequence), call seq on it to produce the sequence, and then operate on that sequence, returning a new sequence. It is up to you whether you need to or how to re-collect that sequence back into a concrete collection. While vectors and lists are ordered, sets and maps are not and thus sequences over these data structures must compute and retain the order outside the collection.
Specialized functions like mapv, filterv, reduce-kv allow you to stay "in the collection" when you know you want the operation to return a collection at the end instead of sequence.
Seqs are ordered structures, whereas maps and sets are unordered. Two maps that are equal in value may have a different internal ordering. For example:
user=> (seq (array-map :a 1 :b 2))
([:a 1] [:b 2])
user=> (seq (array-map :b 2 :a 1))
([:b 2] [:a 1])
It makes no sense to ask for the rest of a map, because it's not a sequential structure. The same goes for a set.
So what about vectors? They're sequentially ordered, so we could potentially map across a vector, and indeed there is such a function: mapv.
You may well ask: why is this not implicit? If I pass a vector to map, why doesn't it return a vector?
Well, first that would mean making an exception for ordered structures like vectors, and Clojure isn't big on making exceptions.
But more importantly you'd lose one of the most useful properties of seqs: laziness. Chaining together seq functions, such as map and filter is a very common operation, and without laziness this would be much less performant and far more memory-intensive.
The collection classes follow a factory pattern i.e instead of implementing ISeq they implement Sequable i.e you can create a ISeq from the collection but the collection itself is not an ISeq.
Now even if these collections implemented ISeq directly I am not sure how that would solve your problem of having general purpose sequence functions that would return the original object, as that would not make sense at all as these general purpose functions are supposed to work on ISeq, they have no idea about which object gave them this ISeq
Example in java:
interface ISeq {
....
}
class A implements ISeq {
}
class B implements ISeq {
}
static class Helpers {
/*
Filter can only work with ISeq, that's what makes it general purpose.
There is no way it could return A or B objects.
*/
public static ISeq filter(ISeq coll, ...) { }
...
}

Why will (seq #{3 1 22 44}) comes out (1 3 44 22) in clojure?

How does it work?
(seq #{3 1 22 44})
And why the order will be like
(1 3 44 22)
Because the set data structure is, by definition, unordered: http://en.wikipedia.org/wiki/Set_(data_structure)
To be more precise, Clojure's built-in set (which #{blah blah blah} gives you) is a hash set -- that is, a set backed by a hash table (http://en.wikipedia.org/wiki/Hash_tables). It provides you with the following guarantees:
Uniqueness of every element (no duplicates allowed).
O(1) performance characteristics for insertion and containment checks.
Iteration works -- calling seq will give you every element in the set, but in an undefined order.
Undefined order, here, means that the iteration order depends on the elements you inserted in the set, their number, the order in which you inserted them, all the other operations you may have tried on that set before, and various other implementation details that might change from a language version to the other (and even between implementations -- you might, and probably will, get different results in Clojure, Clojure running on a 64-bit JVM, or ClojureScript).
The important thing to take away is, if you're writing code that works with sets (or maps), never make it depend on any notion of order in said sets/maps. It'll break.
#{3 1 22 44} is a set in Clojure, which is not ordered sequence.
Thus when you do seq on a set, the order of the resulting seq is arbitrary (but will be the same every time you call seq on this instance).
If you want the set to be sorted, you can create a sorted set with sorted-set

in Clojure, is a vector a specific hashmap?

On "Programming Clojure", there is an example using get function on a vector:
(get [:a :b :c] 1)
-> :b
I called (doc get) and it looks like get function takes hashmap as argument but not vector, so I wander if vector is some kind of hashmap. I remember a hashmap can take an index integer, and return value matching that index, so I did this to see if vector can do same thing:
([1 2 3 4] 1)
-> 2
It did return value 2, which is at index 1 in [1 2 3 4].
Does this mean a vector is a hashmap, whose keys-value pair is index-value pair?
No, the underlying implementation is different.
That being said, since logically vectors do map indices to elements, they are associative structures in Clojure and can be used with get, contains? and assoc (though for assoc only indices from 0 to 1 past the end of the vector can be used). They cannot be used with dissoc though -- that's a "real map" operation.
Also, vectors act differently to maps when used as functions: calling a map as a function is equivalent to using it with get, while calling a vector is equivalent to using nth. The difference is that nth throws an exception on index-out-of-bounds (as well as arguments which could not possibly be indices, such as negative numbers or non-numbers), whereas get returns nil.

how to model the below transformation

I have a hashmap of data with the key a string description and value an integer weight.
{:a 2 :b 3 ......}
I need to transform the hash into a vector of vectors. Each internal vector contains the map entries.
[[[:a 2][:b 3]...][......]]
each internal vector is built based upon some rules. Ex the sum of all weights should not exceed a certain value
Normally this seems to be a good case for a reduce where a hash is transformed into a vector of vectors of map entries. However I may need to iterate over the hash more than once as I may need to reshuffle the entries between the internal vectors so that all of the vectors sum up to a certain num.
Any suggestions on how the problem should me modelled?
Well, for starters, Clojure maps are already sequences of vectors. So no reduce needed:
=> (for [e {:a 1 :b 2}] e)
([:a 1] [:b 2])
Instead of thinking of it as "iterating," you should take the approach of defining a function that takes your input vectors and returns a new sequence with the adjustment. Recursively call this function until the sum you need is reached.

Fast insert into the beginning and end of a clojure seq?

In clojure lists grow from the left and vectors grow from the right, so:
user> (conj '(1 2 3) 4)
(4 1 2 3)
user> (conj [1 2 3] 4)
[1 2 3 4]
What's the most efficient method of inserting values both into the front and the back of a sequence?
You need a different data structure to support fast inserting at both start and end. See https://github.com/clojure/data.finger-tree
As I understand it, a sequence is just a generic data structure so it depends on the specific implementation you are working with.
For a data structure that supports random access (e.g. a vector), it should take constant time, O(1).
For a list, I would expect inserting at the front of the list with a cons operation to take constant time, but inserting to the back of the list will take O(n) since you have to traverse the entire structure to get to the end.
There is, of course, a lot of other data structures that can theoretically be a sequence (e.g. trees) that will have their own O(n) characteristics.