What is the complexity of calculating frequencies of elements in a collection? - clojure

Here is the implementation of frequencies in clojure:
(defn frequencies
"Returns a map from distinct items in coll to the number of times
they appear."
[coll]
(persistent!
(reduce (fn [counts x]
(assoc! counts x (inc (get counts x 0))))
(transient {}) coll)))
Is assoc! considered a mutation or not?
What is the complexity of assoc! inside frequencies?
Also it seems that counts is accessed twice in each iteration: does it cause a performance penalty?

assoc! is mutation of a transient, it is O(log n) amortised I believe. Hence the whole executions of frequencies is O(n log n).
counts is a locally bound variable, so accessing it twice is no problem.
Here is a functional version of freqencies that doesn't use any multiple state:
(defn frequencies-2 [coll]
(reduce (fn [m v] (assoc m v (inc (get m v 0)))) {} coll))
This functional version is also O(n log n), though it will have somewhat more overhead (a higher constant factor) due to creating and discarding more temporary objects.

You could use a tree to store the map from elements to frequencies with log(n) complexity (it can be a binary search tree, an AVL, a red-black tree, etc.).
Choose a functional implementation of this tree, i.e. you can't mutate it, but instead assoc counts x freq returns a new data structure, sharing in memomry the common parts with counts. It's a kind of "copy on write".
Then the performance of computing all frequencies would be O(n log(n)).

assoc! mutates a transient data and it has much better performance than assoc. It is not really a violation of the immutable Clojure's model (see http://clojure.org/transients).
persistent! and transient are O(1)
assoc! is O(log32 n) which is effectively O(1) as hash-map has an upper bound on the size of ~2^32 items this leaves a maximum tree depth of 6
Therefore, the complexity of frequencies is linear on the size of coll.
Remark: As noticed by #mikera, the complexity of frequencies would be linear also with assoc but with a higher constant factor.

Related

Loops & state management in test.check

With the introduction of Spec, I try to write test.check generators for all of my functions. This is fine for simple data structures, but tends to become difficult with data structures that have parts that depend on each other. In other words, some state management within the generators is then required.
It would already help enormously to have generator-equivalents of Clojure's loop/recur or reduce, so that a value produced in one iteration can be stored in some aggregated value that is then accessible in a subsequent iteration.
One simple example where this would be required, is to write a generator for splitting up a collection into exactly X partitions, with each partition having between zero and Y elements, and where the elements are then randomly assigned to any of the partitions. (Note that test.chuck's partition function does not allow to specify X or Y).
If you write this generator by looping through the collection, then this would require access to the partitions filled up during previous iterations, to avoid exceeding Y.
Does anybody have any ideas? Partial solutions I have found:
test.check's let and bind allow you to generate a value and then reuse that value later on, but they do not allow iterations.
You can iterate through a collection of previously generated values with a combination of the tuple and bindfunctions, but these iterations do not have access to the values generated during previous iterations.
(defn bind-each [k coll] (apply tcg/tuple (map (fn [x] (tcg/bind (tcg/return x) k)) coll))
You can use atoms (or volatiles) to store & access values generated during previous iterations. This works, but is very un-Clojure, in particular because you need to reset! the atom/volatile before the generator is returned, to avoid that their contents would get reused in the next call of the generator.
Generators are monad-like due to their bind and return functions, which hints at the use of a monad library such as Cats in combination with a State monad. However, the State monad was removed in Cats 2.0 (because it was allegedly not a good fit for Clojure), while other support libraries I am aware of do not have formal Clojurescript support. Furthermore, when implementing a State monad in his own library, Jim Duey — one of Clojure's monad experts — seems to warn that the use of the State monad is not compatible with test.check's shrinking (see the bottom of http://www.clojure.net/2015/09/11/Extending-Generative-Testing/), which significantly reduces the merits of using test.check.
You can accomplish the iteration you're describing by combining gen/let (or equivalently gen/bind) with explicit recursion:
(defn make-foo-generator
[state]
(if (good-enough? state)
(gen/return state)
(gen/let [state' (gen-next-step state)]
(make-foo-generator state'))))
However, it's worth trying to avoid this pattern if possible, because each use of let/bind undermines the shrinking process. Sometimes it's possible to reorganize the generator using gen/fmap. For example, to partition a collection into a sequence of X subsets (which I realize is not exactly what your example was, but I think it could be tweaked to fit), you could do something like this:
(defn partition
[coll subset-count]
(gen/let [idxs (gen/vector (gen/choose 0 (dec subset-count))
(count coll))]
(->> (map vector coll idxs)
(group-by second)
(sort-by key)
(map (fn [[_ pairs]] (map first pairs))))))

Find maximum element in Clojure sorted-set

I am looking for an idiomatic, O(1), way to extract the maximum element of a sorted-set of integers, in Clojure. Do I have to cast the set into a seq?
With Clojure's built-in sorted sets, this is O(log n):
(first (rseq the-set))
With data.avl sets, the above also works, and you can also use nth to access elements by rank in O(log n) time:
(nth the-set (dec (count the-set)))
None of these data structures offers O(1) access to extreme elements, but you could certainly use [the-set max-element] bundles (possibly represented as records for cleaner manipulation) with a custom API if you expect to perform frequent repeated accesses to the maximum element.
If you only care about fast access to the maximum element and not about ordered traversals, you could actually use the bundle approach with "unsorted" sets – either built-in hash sets or data.int-map sets. That's not useful for repeated disj of maximum elements, of course, only for maintaining upper bound information for a fundamentally unsorted set (unless you keep a stack of max elements in the bundle, at which point it definitely makes sense to simply use sorted sets and get all relevant operations for free).
If you take into account the construction of the set, (into #{} my-coll) & (max my-set) is faster, than (into (sorted-seq) my-coll) & (last my-sorted-set).

Filter, then map? Or just use a for loop?

I keep running into situations where I need to filter a collection of maps by some function, and then pull out one value from each of the resulting maps to make my final collection.
I often use this basic structure:
(map :key (filter some-predicate coll))
It occurred to me that this basically accomplishes the same thing as a for loop:
(for [x coll :when (some-predicate x)] (:key x))
Is one way more efficient than the other? I would think the for version would be more efficient since we only go through the collection once.. Is this accurate?
Neither is significantly different.
Both of these return an unrealized lazy sequence where each time an item is read it is computed. The first one does not traverse the list twice, it instead creates one lazy sequence which that produces items that match the filter and is then immediately consumed (still lazily) by the map function. So in this first case you have one lazy sequence consuming items from another lazy sequence lazily. The call to for on the other hand produces a single lazy-seq with a lot of logic in each step.
You can see the code that the for example expands into with:
(pprint (macroexpand-1 '(for [x coll :when (some-predicate x)] (:key x))))
On the whole the performance will be very similar with the second method perhaps producing slightly less garbage so the only way for you to decide between these on the basis of performance will be benchmarking. On the basis of style, I choose the first one because it is shorter, though I might choose to write it with the thread-last macro if there where more stages.
(->> coll
(filter some-predicate)
(take some-limit)
(map :key))
Though this basically comes down to personal style

recursive function for finding dependencies

I have a collection of maps. Given any map, I want to find all the maps it depends on. Any map will have immediate dependencies. Each one of the immediate dependencies will in turn have their own dependencies, and so and so forth.
I am having trouble writing a recursive function. The code below gives a stackoverflow error. (The algorithm is not efficient anyway - I would appreciate help in cleaning it).
Below is my implementation -
find-deps takes a map and returns a coll of maps: the immediate dependencies of the map.
(find-deps [m]) => coll
The function below -
Checks if direct-deps, the immediate deps, is empty, as the base condition.
If not, it maps find-deps on all the immediate deps. This is the step
which is causing the problem.
Usually in recursive functions we are able to narrow down the initial input, but here my input keeps on increasing !
(defn find-all-deps
([m]
(find-all-deps m []))
([m all-deps]
(let [direct-deps (find-deps m)]
(if-not (seq direct-deps)
all-deps
(map #(find-all-deps % (concat all-deps %)) direct-deps)))))
When working with directed graphs, it's often useful to ensure that you don't visit a node in the graph twice. This problem seems a lot like many graph traversal problems out there, and your solution is very close to a normal depth-first traversal, you just need to not follow cycles (or not allow them in the input). One way to start would be to ensure that a dependency is not in the graph before you visit it again. unfortunately map is poorly suited to this, because each element in the list you are mapping over can't know about the dependencies in other elements of th same list. reduce is a better fit here because you are looking for a single answer.
...
(if-not (seq direct-deps)
all-deps
(reduce (fn [result this-dep]
(if-not (contains? result this-dep)
(find-all-deps this-dep (conj result this-dep))
result)
direct-deps
#{}) ;; use a set to prevent duplicates.

Clojure Lazy Sequences that are Vectors

I have noticed that lazy sequences in Clojure seem to be represented internally as linked lists (Or at least they are being treated as a sequence with only sequential access to elements). Even after being cached into memory, access time over the lazy-seq with nth is O(n), not constant time as with vectors.
;; ...created my-lazy-seq here and used the first 50,000 items
(time (nth my-lazy-seq 10000))
"Elapsed time: 1.081325 msecs"
(time (nth my-lazy-seq 20000))
"Elapsed time: 2.554563 msecs"
How do I get a constant-time lookups or create a lazy vector incrementally in Clojure?
Imagine that during generation of the lazy vector, each element is a function of all elements previous to it, so the time spent traversing the list becomes a significant factor.
Related questions only turned up this incomplete Java snippet:
Designing a lazy vector: problem with const
Yes, sequences in Clojure are described as "logical lists" with three operations (first, next and cons).
A sequence is essentially the Clojure version of an iterator (although clojure.org insists that sequences are not iterators, since they don't hold iternal state), and can only move through the backing collection in a linear front-to-end fashion.
Lazy vectors do not exist, at least not in Clojure.
If you want constant time lookups over a range of indexes, without calculating intermediate elements you don't need, you can use a function that calculates the result on the fly. Combined with memoization (or caching the results in an arg-to-result hash on your own) you get pretty much the same effect as I assume you want from the lazy vector.
This obviously only works when there are algorithms that can compute f(n) more directly than going through all preceding f(0)...f(n-1). If there is no such algorithm, when the result for every element depends on the result for every previous element, you can't do better than the sequence iterator in any case.
Edit
BTW, if all you want is for the result to be a vector so you get quick lookups afterwards, and you don't mind that elements are created sequentially the first time, that's simple enough.
Here is a Fibonacci implementation using a vector:
(defn vector-fib [v]
(let [a (v (- (count v) 2)) ; next-to-last element
b (peek v)] ; last element
(conj v (+ a b))))
(def fib (iterate vector-fib [1 1]))
(first (drop 10 fib))
=> [1 1 2 3 5 8 13 21 34 55 89 144]
Here we are using a lazy sequence to postpone the function calls until asked for (iterate returns a lazy sequence), but the results are collected and returned in a vector.
The vector grows as needed, we add only the elements up to the last one asked for, and once computed it's a constant time lookup.
Was it something like this you had in mind?