Find maximum element in Clojure sorted-set - clojure

I am looking for an idiomatic, O(1), way to extract the maximum element of a sorted-set of integers, in Clojure. Do I have to cast the set into a seq?

With Clojure's built-in sorted sets, this is O(log n):
(first (rseq the-set))
With data.avl sets, the above also works, and you can also use nth to access elements by rank in O(log n) time:
(nth the-set (dec (count the-set)))
None of these data structures offers O(1) access to extreme elements, but you could certainly use [the-set max-element] bundles (possibly represented as records for cleaner manipulation) with a custom API if you expect to perform frequent repeated accesses to the maximum element.
If you only care about fast access to the maximum element and not about ordered traversals, you could actually use the bundle approach with "unsorted" sets – either built-in hash sets or data.int-map sets. That's not useful for repeated disj of maximum elements, of course, only for maintaining upper bound information for a fundamentally unsorted set (unless you keep a stack of max elements in the bundle, at which point it definitely makes sense to simply use sorted sets and get all relevant operations for free).

If you take into account the construction of the set, (into #{} my-coll) & (max my-set) is faster, than (into (sorted-seq) my-coll) & (last my-sorted-set).

Related

Loops & state management in test.check

With the introduction of Spec, I try to write test.check generators for all of my functions. This is fine for simple data structures, but tends to become difficult with data structures that have parts that depend on each other. In other words, some state management within the generators is then required.
It would already help enormously to have generator-equivalents of Clojure's loop/recur or reduce, so that a value produced in one iteration can be stored in some aggregated value that is then accessible in a subsequent iteration.
One simple example where this would be required, is to write a generator for splitting up a collection into exactly X partitions, with each partition having between zero and Y elements, and where the elements are then randomly assigned to any of the partitions. (Note that test.chuck's partition function does not allow to specify X or Y).
If you write this generator by looping through the collection, then this would require access to the partitions filled up during previous iterations, to avoid exceeding Y.
Does anybody have any ideas? Partial solutions I have found:
test.check's let and bind allow you to generate a value and then reuse that value later on, but they do not allow iterations.
You can iterate through a collection of previously generated values with a combination of the tuple and bindfunctions, but these iterations do not have access to the values generated during previous iterations.
(defn bind-each [k coll] (apply tcg/tuple (map (fn [x] (tcg/bind (tcg/return x) k)) coll))
You can use atoms (or volatiles) to store & access values generated during previous iterations. This works, but is very un-Clojure, in particular because you need to reset! the atom/volatile before the generator is returned, to avoid that their contents would get reused in the next call of the generator.
Generators are monad-like due to their bind and return functions, which hints at the use of a monad library such as Cats in combination with a State monad. However, the State monad was removed in Cats 2.0 (because it was allegedly not a good fit for Clojure), while other support libraries I am aware of do not have formal Clojurescript support. Furthermore, when implementing a State monad in his own library, Jim Duey — one of Clojure's monad experts — seems to warn that the use of the State monad is not compatible with test.check's shrinking (see the bottom of http://www.clojure.net/2015/09/11/Extending-Generative-Testing/), which significantly reduces the merits of using test.check.
You can accomplish the iteration you're describing by combining gen/let (or equivalently gen/bind) with explicit recursion:
(defn make-foo-generator
[state]
(if (good-enough? state)
(gen/return state)
(gen/let [state' (gen-next-step state)]
(make-foo-generator state'))))
However, it's worth trying to avoid this pattern if possible, because each use of let/bind undermines the shrinking process. Sometimes it's possible to reorganize the generator using gen/fmap. For example, to partition a collection into a sequence of X subsets (which I realize is not exactly what your example was, but I think it could be tweaked to fit), you could do something like this:
(defn partition
[coll subset-count]
(gen/let [idxs (gen/vector (gen/choose 0 (dec subset-count))
(count coll))]
(->> (map vector coll idxs)
(group-by second)
(sort-by key)
(map (fn [[_ pairs]] (map first pairs))))))

IndexedSeq VS. PersistentVector

Can somebody explain me, the difference between 'IndexedSeq' and 'PersistentVector'?
I bumped into this, when updating a vector in my data structure via 'rest'. Here's a REPL excerpt that shows the transformation.
=> (def xs [1 2 3])
...
(type xs)
cljs.core/PersistentVector
=> (def xs2 (rest xs))
...
(type xs2)
cljs.core/IndexedSeq
I'm holding a list in an app-state atom, which needs to be shifted once in a while, so the first item must disappear. Would be really cool, if anybody could give me a hint about which data structure might be preferable here in terms of performance.
Sometimes elements get pushed to the end of the list as well, so I guess it's a LIFO mechanism that I'm creating here.
From your last paragraph, it sounds like you're using this as a stack. Taken together, pop, peek, and conj form a stack interface that can be used with either lists or vectors (working on the front of a list or the end of a vector). I would use those.
If you're just using those functions, I don't think there should be any significant performance differences (all three functions should be constant time).
looking at the superinterfaces here: http://static.javadoc.io/org.clojure/clojure/1.7.0/clojure/lang/IndexedSeq.html
I can guess, it is not the most efficient thing here, since it is just a seq, with no guaranteed constant-time access to the nth member. To ensure the vector semantics you should probably use subvec to remove the first element.
In general, if you don't do random access to elements, in terms of performance it should be enough to use concat to add element to the end (as it produces a lazy sequence, won't consume the whole collection, and should be done in a constant time) and rest to remove the first element (as it is also done in a constant time), to make FIFO stack (which is what you do). (it's not the best variant still, since it may lead to stack owerflow, if you do alot of push without realizing the sequence.
But sure it's better to use vectors. So the combination of conj , first, and subvec should be your choice.

Why clojure collections don't implement ISeq interface directly?

Every collection in clojure is said to be "sequable" but only list and cons are actually seqs:
user> (seq? {:a 1 :b 2})
false
user> (seq? [1 2 3])
false
All other seq functions first convert a collection to a sequence and only then operate on it.
user> (class (rest {:a 1 :b 2}))
clojure.lang.PersistentArrayMap$Seq
I cannot do things like:
user> (:b (rest {:a 1 :b 2}))
nil
user> (:b (filter #(-> % val (= 1)) {:a 1 :b 1 :c 2}))
nil
and have to coerce back to concrete data type. This looks like bad design to me, but most likely I just don't get it as yet.
So, why clojure collections don't implement ISeq interface directly and all seq functions don't return an object of the same class as the input object?
This has been discussed on the Clojure google group; see for example the thread map semantics from February of this year. I'll take the liberty of reusing some of the points I made in my message to that thread below while adding several new ones.
Before I go on to explain why I think the "separate seq" design is the correct one, I would like to point out that a natural solution for the situations where you'd really want to have an output similar to the input without being explicit about it exists in the form of the function fmap from the contrib library algo.generic. (I don't think it's a good idea to use it by default, however, for the same reasons for which the core library design is a good one.)
Overview
The key observation, I believe, is that the sequence operations like map, filter etc. conceptually divide into three separate concerns:
some way of iterating over their input;
applying a function to each element of the input;
producing an output.
Clearly 2. is unproblematic if we can deal with 1. and 3. So let's have a look at those.
Iteration
For 1., consider that the simplest and most performant way to iterate over a collection typically does not involve allocating intermediate results of the same abstract type as the collection. Mapping a function over a chunked seq over a vector is likely to be much more performant than mapping a function over a seq producing "view vectors" (using subvec) for each call to next; the latter, however, is the best we can do performance-wise for next on Clojure-style vectors (even in the presence of RRB trees, which are great when we need a proper subvector / vector slice operation to implement an interesting algorithm, but make traversals terrifying slow if we used them to implement next).
In Clojure, specialized seq types maintain traversal state and extra functionality such as (1) a node stack for sorted maps and sets (apart from better performance, this has better big-O complexity than traversals using dissoc / disj!), (2) current index + logic for wrapping leaf arrays in chunks for vectors, (3) a traversal "continuation" for hash maps. Traversing a collection through an object like this is simply faster than any attempt at traversing through subvec / dissoc / disj could be.
Suppose, however, that we're willing to accept the performance hit when mapping a function over a vector. Well, let's try filtering now:
(->> some-vector (map f) (filter p?))
There's a problem here -- there's no good way to remove elements from a vector. (Again, RRB trees could help in theory, but in practice all the RRB slicing and concatenating involved in producing "real vector" for filtering operations would absolutely destroy performance.)
Here's a similar problem. Consider this pipeline:
(->> some-sorted-set (filter p?) (map f) (take n))
Here we benefit from laziness (or rather, from the ability to stop filtering and mapping early; there's a point involving reducers to be made here, see below). Clearly take could be reordered with map, but not with filter.
The point is that if it's ok for filter to convert to seq implicitly, then it is also ok for map; and similar arguments can be made for other sequence functions. Once we've made the argument for all -- or nearly all -- of them, it becomes clear that it also makes sense for seq to return specialized seq objects.
Incidentally, filtering or mapping a function over a collection without producing a similar collection as a result is very useful. For example, often we care only about the result of reducing the sequence produced by a pipeline of transformations to some value or about calling a function for side effect at each element. For these scenarios, there is nothing whatsoever to be gained by maintaining the input type and quite a lot to be lost in performance.
Producing an output
As noted above, we do not always want to produce an output of the same type as the input. When we do, however, often the best way to do so is to do the equivalent of pouring a seq over the input into an empty output collection.
In fact, there is absolutely no way to do better for maps and sets. The fundamental reason is that for sets of cardinality greater than 1 there is no way to predict the cardinality of the output of mapping a function over a set, since the function can "glue together" (produce the same outputs for) arbitrary inputs.
Additionally, for sorted maps and sets there is no guarantee that the input set's comparator will be able to deal with outputs from an arbitrary function.
So, if in many cases there is no way to, say, map significantly better than by doing a seq and an into separately, and considering how both seq and into make useful primitives in their own right, Clojure makes the choice of exposing the useful primitives and letting users compose them. This lets us map and into to produce a set from a set, while leaving us the freedom to not go on to the into stage when there is no value to be gained by producing a set (or another collection type, as the case may be).
Not all is seq; or, consider reducers
Some of the problems with using the collection types themselves when mapping, filtering etc. don't apply when using reducers.
The key difference between reducers and seqs is that the intermediate objects produced by clojure.core.reducers/map and friends only produce "descriptor" objects that maintain information on what computations need to be performed in the event that the reducer is actually reduced. Thus, individual stages of the computation can be merged.
This allows us to do things like
(require '[clojure.core.reducers :as r])
(->> some-set (r/map f) (r/filter p?) (into #{}))
Of course we still need to be explicit about our (into #{}), but this is just a way of saying "the reducers pipeline ends here; please produce the result in the form of a set". We could also ask for a different collection type (a vector of results perhaps; note that mapping f over a set may well produce duplicate results and we may in some situations wish to preserve them) or a scalar value ((reduce + 0)).
Summary
The main points are these:
the fastest way to iterate over a collection typically doesn't involve produce intermediate results similar to the input;
seq uses the fastest way to iterate;
the best approach to transforming a set by mapping or filtering involves using a seq-style operation, because we want to iterate very fast while accumulating an output;
thus seq makes a great primitive;
map and filter, in their choice to deal with seqs, depending on the scenario, may avoid performance penalties without upsides, benefit from laziness etc., yet can still be used to produce a collection result with into;
thus they too make great primitives.
Some of these points may not apply to a statically typed language, but of course Clojure is dynamic. Additionally, when we do want to a return that matches input type, we're simply forced to be explicit about it and that, in itself, may be viewed as a good thing.
Sequences are a logical list abstraction. They provide access to a (stable) ordered sequence of values. They are implemented as views over collections (except for lists where the concrete interface matches the logical interface). The sequence (view) is a separate data structure that refers into the collection to provide the logical abstraction.
Sequence functions (map, filter, etc) take a "seqable" thing (something which can produce a sequence), call seq on it to produce the sequence, and then operate on that sequence, returning a new sequence. It is up to you whether you need to or how to re-collect that sequence back into a concrete collection. While vectors and lists are ordered, sets and maps are not and thus sequences over these data structures must compute and retain the order outside the collection.
Specialized functions like mapv, filterv, reduce-kv allow you to stay "in the collection" when you know you want the operation to return a collection at the end instead of sequence.
Seqs are ordered structures, whereas maps and sets are unordered. Two maps that are equal in value may have a different internal ordering. For example:
user=> (seq (array-map :a 1 :b 2))
([:a 1] [:b 2])
user=> (seq (array-map :b 2 :a 1))
([:b 2] [:a 1])
It makes no sense to ask for the rest of a map, because it's not a sequential structure. The same goes for a set.
So what about vectors? They're sequentially ordered, so we could potentially map across a vector, and indeed there is such a function: mapv.
You may well ask: why is this not implicit? If I pass a vector to map, why doesn't it return a vector?
Well, first that would mean making an exception for ordered structures like vectors, and Clojure isn't big on making exceptions.
But more importantly you'd lose one of the most useful properties of seqs: laziness. Chaining together seq functions, such as map and filter is a very common operation, and without laziness this would be much less performant and far more memory-intensive.
The collection classes follow a factory pattern i.e instead of implementing ISeq they implement Sequable i.e you can create a ISeq from the collection but the collection itself is not an ISeq.
Now even if these collections implemented ISeq directly I am not sure how that would solve your problem of having general purpose sequence functions that would return the original object, as that would not make sense at all as these general purpose functions are supposed to work on ISeq, they have no idea about which object gave them this ISeq
Example in java:
interface ISeq {
....
}
class A implements ISeq {
}
class B implements ISeq {
}
static class Helpers {
/*
Filter can only work with ISeq, that's what makes it general purpose.
There is no way it could return A or B objects.
*/
public static ISeq filter(ISeq coll, ...) { }
...
}

What is the difference between clojure's APersistentMap implementations

I'm trying to figure out what the difference is between a PersistentHashMap, PersistentArrayMap, PersistentTreeMap, and PersistentStructMap.
Also if I use {:a 1} it gives me a PersistentArrayMap but can this change to any of the other ones if I give it objects or things other than keys?
The four implementations you list fall into three groups:
"literal": PersistentArrayMap and PersistentHashMap: basic map types used when dealing with map literals (though constructor functions are also available with different behaviour around handling duplicate keys -- in Clojure 1.5.x literals throw exceptions when they discover duplicate keys, constructor functions work like left-to-right repeated conjing; this behaviour has been evolving from version to version). Array maps get promoted to hash maps when growing beyond a certain number of entries (9 IIRC). Array maps exist because they are faster for small maps; they also differ from hash maps in that they keep entries in insertion order prior to promotion to hash map (you can use clojure.core/array-map to get arbitrarily large array maps, which may be useful if you really know you'd benefit from insertion-order traversals and the map won't be too large, perhaps just a bit over the usual threshold; NB. a subsequent assoc to such an oversized array map will return a hash map). Array maps use arrays with keys and values interleaved; the PHM uses a persistent version of Phil Bagwell's hash array mapped trie with separate chaining for hash collisions and separate node types for mostly-empty and at-least-half-full nodes and is easily the most complex data structure in Clojure.
sorted: PersistentTreeMap instances are created by special request only (a call to sorted-map or sorted-map-by). They are implemented as red-black trees and maintain entries in a particular order, as specified by the default compare comparator if created with sorted-map or a user-supplied comparator if created with sorted-map-by.
special-purpose, probably deprecated: PersistentStructMap is not used very often and mostly viewed as deprecated in favour of records, although I actually can't remember right now if there ever was an official deprecation notice. The original purpose was to provide maps with particularly fast access to certain often-used keys. This can now be accomplished with records when using keywords for field access (with the keyword in the operator position: (:foo instance-of-some-record-with-field-foo)), though it's important to note that records are not = to regular maps with the same entries.
All these four built-in map types fall into the same "equality partition", that is, any two maps of one of the four classes mentioned above will be equal if (and only if) they contain the same keys (as determined by Clojure's =) with the same corresponding values. Records, as mentioned in 3. above, are map-like, but each record type forms its own equality partition.
They are different implementation of a Persistent Map (they all extend APersistentMap). So a PersistentArrayMap uses an array as the underlying data structure to implement persistent map and similarly other implementations uses different underlying data stucture.
The reason for different implementation is they provide different benefits in different situations (as the efficiency of the implementation depends on the underlying data structure).
From a developer perspective, it is abstracted away and hence you should not be directly using
them and instead work with the APersistentMap abstract class or IPersistentMap interface (in case some type checking etc is required for some specific case).
Depending on the number of elements in the map the various implementations are used.
(type (into {} (map #(-> [% %]) (range 5))))
=> PersistentArrayMap
(type (into {} (map #(-> [% %]) (range 10))))
=> PersistentHashMap

Clojure Lazy Sequences that are Vectors

I have noticed that lazy sequences in Clojure seem to be represented internally as linked lists (Or at least they are being treated as a sequence with only sequential access to elements). Even after being cached into memory, access time over the lazy-seq with nth is O(n), not constant time as with vectors.
;; ...created my-lazy-seq here and used the first 50,000 items
(time (nth my-lazy-seq 10000))
"Elapsed time: 1.081325 msecs"
(time (nth my-lazy-seq 20000))
"Elapsed time: 2.554563 msecs"
How do I get a constant-time lookups or create a lazy vector incrementally in Clojure?
Imagine that during generation of the lazy vector, each element is a function of all elements previous to it, so the time spent traversing the list becomes a significant factor.
Related questions only turned up this incomplete Java snippet:
Designing a lazy vector: problem with const
Yes, sequences in Clojure are described as "logical lists" with three operations (first, next and cons).
A sequence is essentially the Clojure version of an iterator (although clojure.org insists that sequences are not iterators, since they don't hold iternal state), and can only move through the backing collection in a linear front-to-end fashion.
Lazy vectors do not exist, at least not in Clojure.
If you want constant time lookups over a range of indexes, without calculating intermediate elements you don't need, you can use a function that calculates the result on the fly. Combined with memoization (or caching the results in an arg-to-result hash on your own) you get pretty much the same effect as I assume you want from the lazy vector.
This obviously only works when there are algorithms that can compute f(n) more directly than going through all preceding f(0)...f(n-1). If there is no such algorithm, when the result for every element depends on the result for every previous element, you can't do better than the sequence iterator in any case.
Edit
BTW, if all you want is for the result to be a vector so you get quick lookups afterwards, and you don't mind that elements are created sequentially the first time, that's simple enough.
Here is a Fibonacci implementation using a vector:
(defn vector-fib [v]
(let [a (v (- (count v) 2)) ; next-to-last element
b (peek v)] ; last element
(conj v (+ a b))))
(def fib (iterate vector-fib [1 1]))
(first (drop 10 fib))
=> [1 1 2 3 5 8 13 21 34 55 89 144]
Here we are using a lazy sequence to postpone the function calls until asked for (iterate returns a lazy sequence), but the results are collected and returned in a vector.
The vector grows as needed, we add only the elements up to the last one asked for, and once computed it's a constant time lookup.
Was it something like this you had in mind?