Good Clojure representation for unordered pairs? - clojure

I'm creating unordered pairs of data elements. A comment by #Chouser on this question says that hash-sets are implemented with 32 children per node, while sorted-sets are implemented with 2 children per node. Does this mean that my pairs will take up less space if I implement them with sorted-sets rather than hash-sets (assuming that the data elements are Comparable, i.e. can be sorted)? (I doubt it matters for me in practice. I'll only have hundreds of these pairs, and lookup in a two-element data structure, even sequential lookup in a vector or list, should be fast. But I'm curious.)

When comparing explicitly looking at the first two elements of a list, to using Clojure's built in sets I don't see a significant difference when running it ten million times:
user> (defn my-lookup [key pair]
(condp = key
(first pair) true
(second pair) true false))
#'user/my-lookup
user> (time (let [data `(1 2)]
(dotimes [x 10000000] (my-lookup (rand-nth [1 2]) data ))))
"Elapsed time: 906.408176 msecs"
nil
user> (time (let [data #{1 2}]
(dotimes [x 10000000] (contains? data (rand-nth [1 2])))))
"Elapsed time: 1125.992105 msecs"
nil
Of course micro-benchmarks such as this are inherently flawed and difficult to really do well so don't try to use this to show that one is better than the other. I only intend to demonstrate that they are very similar.

If I'm doing something with unordered pairs, I usually like to use a map since that makes it easy to look up the other element. E.g., if my pair is [2 7], then I'll use {2 7, 7 2}, and I can do ({2 7, 7 2} 2), which gives me 7.
As for space, the PersistentArrayMap implementation is actually very space conscious. If you look at the source code (see previous link), you'll see that it allocates an Object[] of the exact size needed to hold all the key/value pairs. I think this is used as the default map type for all maps with no more than 8 key/value pairs.
The only catch here is that you need to be careful about duplicate keys. {2 2, 2 2} will cause an exception. You could get around this problem by doing something like this: (merge {2 2} {2 2}), i.e. (merge {a b} {b a}) where it's possible that a and b have the same value.
Here's a little snippet from my repl:
user=> (def a (array-map 1 2 3 4))
#'user/a
user=> (type a)
clojure.lang.PersistentArrayMap
user=> (.count a) ; count simply returns array.length/2 of the internal Object[]
2
Note that I called array-map explicitly above. This is related to a question I asked a while ago related to map literals and def in the repl: Why does binding affect the type of my map?

This should be a comment, but i'm too short in reputation and too eager to share information.
If you are concerned about performance clj-tuple by Zachary Tellman may be 2-3 times faster than ordinary list/vectors, as claimed here ztellman / clj-tuple.

I wasn't planning to benchmark different pair representations now, but #ArthurUlfeldt's answer and #DaoWen's led me to do so. Here are my results using criterium's bench macro. Source code is below. To summarize, as expected, there are no large differences between the seven representations I tested. However, there is a gap between times for the fastest, array-map and hash-map, and the others. This is consistent with DaoWen's and Arthur Ulfeldt's remarks.
Average execution time in seconds, in order from fastest to slowest (MacBook Pro, 2.3GHz Intel Core i7):
array-map: 5.602099
hash-map: 5.787275
vector: 6.605547
sorted-set: 6.657676
hash-set: 6.746504
list: 6.948222
Edit: I added a run of test-control below, which does only what is common to all of the different other tests. test-control took, on average, 5.571284 seconds. It appears that there is a bigger difference between the -map representations and the others than I had thought: Access to a hash-map or an array-map of two entries is essentially instantaneous (on my computer, OS, Java, etc.), whereas the other representations take about a second for 10 million iterations. Which, given that it's 10M iterations, means that those operations are still almost instantaneous. (My guess is that the fact that test-arraymap was faster than test-control is due to noise from other things happening in the background on the computer. Or it could have to do with idiosyncrasies of compilation.)
(A caveat: I forgot to mention that I'm getting a warning from criterium: "JVM argument TieredStopAtLevel=1 is active, and may lead to unexpected results as JIT C2 compiler may not be active." I believe this means that Leiningen is starting Java with a command line option that is geared toward the -server JIT compiler, but is being run instead with the default -client JIT compiler. So the warning is saying "you think you're running -server, but you're not, so don't expect -server behavior." Running with -server might change the times given above.)
(use 'criterium.core)
;; based on Arthur Ulfedt's answer:
(defn pairlist-contains? [key pair]
(condp = key
(first pair) true
(second pair) true
false))
(defn pairvec-contains? [key pair]
(condp = key
(pair 0) true
(pair 1) true
false))
(def ntimes 10000000)
;; Test how long it takes to do what's common to all of the other tests
(defn test-control []
(print "=============================\ntest-control:\n")
(bench
(dotimes [_ ntimes]
(def _ (rand-nth [:a :b])))))
(defn test-list []
(let [data '(:a :b)]
(print "=============================\ntest-list:\n")
(bench
(dotimes [_ ntimes]
(def _ (pairlist-contains? (rand-nth [:a :b]) data))))))
(defn test-vec []
(let [data [:a :b]]
(print "=============================\ntest-vec:\n")
(bench
(dotimes [_ ntimes]
(def _ (pairvec-contains? (rand-nth [:a :b]) data))))))
(defn test-hashset []
(let [data (hash-set :a :b)]
(print "=============================\ntest-hashset:\n")
(bench
(dotimes [_ ntimes]
(def _ (contains? data (rand-nth [:a :b])))))))
(defn test-sortedset []
(let [data (sorted-set :a :b)]
(print "=============================\ntest-sortedset:\n")
(bench
(dotimes [_ ntimes]
(def _ (contains? data (rand-nth [:a :b])))))))
(defn test-hashmap []
(let [data (hash-map :a :a :b :b)]
(print "=============================\ntest-hashmap:\n")
(bench
(dotimes [_ ntimes]
(def _ (contains? data (rand-nth [:a :b])))))))
(defn test-arraymap []
(let [data (array-map :a :a :b :b)]
(print "=============================\ntest-arraymap:\n")
(bench
(dotimes [_ ntimes]
(def _ (contains? data (rand-nth [:a :b])))))))
(defn test-all []
(test-control)
(test-list)
(test-vec)
(test-hashset)
(test-sortedset)
(test-hashmap)
(test-arraymap))

Related

Combine transduction output with input into a hashmap

I want to do the following in Clojure as idiomatically as possible:
transduce a collection
associate each element of the input collection with the corresponding element in the output collection
return the result in a hashmap
Is there a succinct way to do this using core library functions?
If not, what improvements can you suggest to the following implementation?
(defn to-hash [coll xform]
(reduce
merge
(map
#(apply hash-map %)
(mapcat hash-map coll (into [] xform coll)))))
something like this should do the trick without intermediate collections:
(defn process [data xform]
(zipmap data (eduction xform data)))
user> (process [1 2 3] (comp (map inc) (map #(* % %))))
;;=> {1 4, 2 9, 3 16}
the docs on eduction say the following:
Returns a reducible/iterable application of the transducers
to the items in coll. Transducers are applied in order as if
combined with comp. Note that these applications will be
performed every time reduce/iterator is called.
so no additional collection is created.
This is any good, of course, as long as there is one-to-one relationship between input and output elements. What is desired output for (process [1 -2 3] (filter pos?)) or (process [1 1 1 2 2 2] (dedupe)) ?
(by the way, your to-hash implementation has the same flaw)
A transducer is a function that takes a reducing function and returns a new reducing function. To make it work with transducers where there is not a one-to-one mapping from elements in the input collection to the output, you will have to use your transducer to create a new reducing function (step2 in the code below) that will associate elements into your hash map. Something like this.
(def ^:dynamic assoc-k nil)
(defn assoc-step [dst x]
(assoc dst assoc-k x))
(defn to-hash [coll xform]
(let [step (xform (completing assoc-step))
step2 (fn [dst x] (binding [assoc-k x] (step dst x)))]
(reduce step2 {} coll)))
This implementation is quite basic and I am not sure to which extent it will work with stateful transducers. But it will work with the stateless ones, such as map and filter.
And we can test it with a transducer that keeps odd elements in the input collection and squares them:
(defn square [x] (* x x))
(to-hash (range 10) (comp (filter odd?) (map square)))
;; => {1 1, 3 9, 5 25, 7 49, 9 81}

Clojure - Difference Between Map and Reduce // Converting One to the Other

(defn DoubleFrequency []
(def s (slurp "Example.txt"))
(def m (reduce #(assoc %1 %2 (inc (%1 %2 0)))
{}
(re-seq #".." s)))
(def c (count m))
(doseq [[k x] m]
(println k ":" (/ x c))))
I'm trying to apply concurrency to my program, and I want to use pmap, but I'm not sure how to work it into my current code here. The functionality is correct for single core, but Ideally I want to replace reduce with pmap in some way and achieve the same results.
first of all, the function you're trying to make up, is called frequencies:
user> (frequencies [1 2 1 3 1 4 4])
;;=> {1 3, 2 1, 3 1, 4 2}
it is, indeed, single threaded. So let's try to make it parallel.
the initial approach with reduce is the right direction, though it's not parallel either, it could be employed to make the parallel one with clojure's standard library concurrency facilities, namely reducers.
first of all, let's rewrite your reducer function a bit, to do the same thing, but in a more idiomatic way (it is optional, but good for readability):
#(assoc %1 %2 (inc (%1 %2 0))) => #(update %1 %2 (fnil inc 0))
then we can approach to the parallel reduce with fold:
(require '[clojure.core.reducers :as r])
(defn pfreq [data]
(r/fold
(partial merge-with +)
(fn [acc k] (update acc k (fnil inc 0)))
data))
the idea is that it splits your collection by chunks (if it is long enough), and then combines chunks' results with merge-with:
user> (pfreq [1 2 1 3 1 4 1 5 2])
;;=> {1 4, 2 2, 3 1, 4 1, 5 1}
notice also, that the collection should be 'foldable'. By default, persistent vectors and maps are foldable, re-seq result is not, so you should first convert it into vector: (vec (re-seq #"..x" s)), otherwise you won't get any parallelization, falling back to plain reduce.
You can obviously approach to this one with pmap, with the same strategy: split -> map -> combine:
(defn pfreq2 [chunk-size data]
(->> data
(partition-all chunk-size)
(pmap frequencies)
(apply merge-with +)))
but this is not as flexible and powerful, as the reducers pipelines.

sort-by on a lazy sequence of hash-maps in clojure

I need to take 20 results from a lazy sequence of millions of hash-maps but for the 20 to be based on sorting on various values within the hashmaps.
For example:
(def population [{:id 85187153851 :name "anna" :created #inst "2012-10-23T20:36:25.626-00:00" :rank 77336}
{:id 12595145186 :name "bob" :created #inst "2011-02-03T20:36:25.626-00:00" :rank 983666}
{:id 98751563911 :name "cartmen" :created #inst "2007-01-13T20:36:25.626-00:00" :rank 112311}
...
{:id 91514417715 :name "zaphod" :created #inst "2015-02-03T20:36:25.626-00:00" :rank 9866}]
In normal circumstances a simple sort-by would get the job done:
(sort-by :id population)
(sort-by :name population)
(sort-by :created population)
(sort-by :rank population)
But I need to do this across millions of records as fast as possible and want to do it lazily rather than having to realize the entire data set.
I looked around a lot and found a number of implementations of algorithms that work really well for sorting a sequence of values (mostly numeric) but none for a lazy sequence of hash-maps in the way I need.
Speed & efficiency being of prime importance, the best I have found has been the quicksort example from the Joy Of Clojure book (Chapter 6.4) which does just enough work to return the required result.
(ns joy.q)
(defn sort-parts
"Lazy, tail-recursive, incremental quicksort. Works against
and creates partitions based on the pivot, defined as 'work'."
[work]
(lazy-seq
(loop [[part & parts] work]
(if-let [[pivot & xs] (seq part)]
(let [smaller? #(< % pivot)]
(recur (list*
(filter smaller? xs)
pivot
(remove smaller? xs)
parts)))
(when-let [[x & parts] parts]
(cons x (sort-parts parts)))))))
(defn qsort [xs]
(sort-parts (list xs)))
Works really well...
(time (take 10 (qsort (shuffle (range 10000000)))))
"Elapsed time: 551.714003 msecs"
(0 1 2 3 4 5 6 7 8 9)
Great! But...
However much I try I can't seem to work out how to apply this to a sequence of hashmaps.
I need something like:
(take 20 (qsort-by :created population))
If you only need the top N elements a full sort is too expensive (even a lazy sort as the one in the JoC: it needs to keep nearly the all data set in memory).
You only need to scan (reduce) the dataset and keep the best N items so far.
=> (defn top-by [n k coll]
(reduce
(fn [top x]
(let [top (conj top x)]
(if (> (count top) n)
(disj top (first top))
top)))
(sorted-set-by #(< (k %1) (k %2))) coll))
#'user/top-by
=> (top-by 3 first [[1 2] [10 2] [9 3] [4 2] [5 6]])
#{[5 6] [9 3] [10 2]}

DSL syntax with optional parameters

I'm trying to handle following DSL:
(simple-query
(is :category "car/audi/80")
(is :price 15000))
that went quite smooth, so I added one more thing - options passed to the query:
(simple-query {:page 1 :limit 100}
(is :category "car/audi/80")
(is :price 15000))
and now I have a problem how to handle this case in most civilized way. as you can see simple-query may get hash-map as a first element (followed by long list of criteria) or may have no hash-mapped options at all. moreover, I would like to have defaults as a default set of options in case when some (or all) of them are not provided explicite in query.
this is what I figured out:
(def ^{:dynamic true} *defaults* {:page 1
:limit 50})
(defn simple-query [& body]
(let [opts (first body)
[params criteria] (if (map? opts)
[(merge *defaults* opts) (rest body)]
[*defaults* body])]
(execute-query params criteria)))
I feel it's kind of messy. any idea how to simplify this construction?
To solve this problem in my own code, I have a handy function I'd like you to meet... take-when.
user> (defn take-when [pred [x & more :as fail]]
(if (pred x) [x more] [nil fail]))
#'user/take-when
user> (take-when map? [{:foo :bar} 1 2 3])
[{:foo :bar} (1 2 3)]
user> (take-when map? [1 2 3])
[nil [1 2 3]]
So we can use this to implement a parser for your optional map first argument...
user> (defn maybe-first-map [& args]
(let [defaults {:foo :bar}
[maybe-map args] (take-when map? args)
options (merge defaults maybe-map)]
... ;; do work
))
So as far as I'm concerned, your proposed solution is more or less spot on, I would just clean it up by factoring out parser for grabbing the options map (here into my take-when helper) and by factoring out the merging of defaults into its own binding statement.
As a general matter, using a dynamic var for storing configurations is an antipattern due to potential missbehavior when evaluated lazily.
What about something like this?
(defn simple-query
[& body]
(if (map? (first body))
(execute-query (merge *defaults* (first body)) (rest body))
(execute-query *defaults* body)))

Clojure 2d list to hash-map

I have an infinite list like that:
((1 1)(3 9)(5 17)...)
I would like to make a hash map out of it:
{:1 1 :3 9 :5 17 ...)
Basically 1st element of the 'inner' list would be a keyword, while second element a value. I am not sure if it would not be easier to do it at creation time, to create the list I use:
(iterate (fn [[a b]] [(computation for a) (computation for b)]) [1 1])
Computation of (b) requires (a) so I believe at this point (a) could not be a keyword... The whole point of that is so one can easily access a value (b) given (a).
Any ideas would be greatly appreciated...
--EDIT--
Ok so I figured it out:
(def my-map (into {} (map #(hash-map (keyword (str (first %))) (first (rest %))) my-list)))
The problem is: it does not seem to be lazy... it just goes forever even though I haven't consumed it. Is there a way to force it to be lazy?
The problem is that hash-maps can be neither infinite nor lazy. They designed for fast key-value access. So, if you have a hash-map you'll be able to perform fast key look-up. Key-value access is the core idea of hash-maps, but it makes creation of lazy infinite hash-map impossible.
Suppose, we have an infinite 2d list, then you can just use into to create hash-map:
(into {} (vec (map vec my-list)))
But there is no way to make this hash-map infinite. So, the only solution for you is to create your own hash-map, like Chouser suggested. In this case you'll have an infinite 2d sequence and a function to perform lazy key lookup in it.
Actually, his solution can be slightly improved:
(def my-map (atom {}))
(def my-seq (atom (partition 2 (range))))
(defn build-map [stop]
(when-let [[k v] (first #my-seq)]
(swap! my-seq rest)
(swap! my-map #(assoc % k v))
(if (= k stop)
v
(recur stop))))
(defn get-val [k]
(if-let [v (#my-map k)]
v
(build-map k)))
my-map in my example stores the current hash-map and my-seq stores the sequence of not yet processed elements. get-val function performs a lazy look-up, using already processed elements in my-map to improve its performance:
(get-val 4)
=> 5
#my-map
=> {4 5, 2 3, 0 1}
And a speed-up:
(time (get-val 1000))
=> Elapsed time: 7.592444 msecs
(time (get-val 1000))
=> Elapsed time: 0.048192 msecs
In order to be lazy, the computer will have to do a linear scan of the input sequence each time a key is requested, at the very least if the key is beyond what has been scanned so far. A naive solution is just to scan the sequence every time, like this:
(defn get-val [coll k]
(some (fn [[a b]] (when (= k a) b)) coll))
(get-val '((1 1)(3 9)(5 17))
3)
;=> 9
A slightly less naive solution would be to use memoize to cache the results of get-val, though this would still scan the input sequence more than strictly necessary. A more aggressively caching solution would be to use an atom (as memoize does internally) to cache each pair as it is seen, thereby only consuming more of the input sequence when a lookup requires something not yet seen.
Regardless, I would not recommend wrapping this up in a hash-map API, as that would imply efficient immutable "updates" that would likely not be needed and yet would be difficult to implement. I would also generally not recommend keywordizing the keys.
If you flatten it down to a list of (k v k v k v k v) with flatten then you can use apply to call hash-map with that list as it's arguments which will git you the list you seek.
user> (apply hash-map (flatten '((1 1)(3 9)(5 17))))
{1 1, 3 9, 5 17}
though it does not keywordize the first argument.
At least in clojure the last value associated with a key is said to be the value for that key. If this is not the case then you can't produce a new map with a different value for a key that is already in the map, because the first (and now shadowed key) would be returned by the lookup function. If the lookup function searches to the end then it is not lazy. You can solve this by writing your own map implementation that uses association lists, though it would lack the performance guarantees of Clojure's trei based maps because it would devolve to linear time in the worst case.
Im not sure keeping the input sequence lazy will have the desired results.
To make a hashmap from your sequence you could try:
(defn to-map [s] (zipmap (map (comp keyword str first) s) (map second s)))
=> (to-map '((1 1)(3 9)(5 17)))
=> {:5 17, :3 9, :1 1}
You can convert that structure to a hash-map later this way
(def it #(iterate (fn [[a b]] [(+ a 1) (+ b 1)]) [1 1]))
(apply hash-map (apply concat (take 3 (it))))
=> {1 1, 2 2, 3 3}